SPECIAL EDITION: Q2 2024 Best of

With the most recent Monitorama behind us, it feels like a great time for our quarterly “best of” issue! We have some fantastic articles here covering the most popular topics and themes from the past few months. Enjoy!

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Making sense of Grafana Dashboards

This article was written for anyone who’s installed Grafana, hooked up some exporters, and was left scratching their heads wondering “what now?”. Great job deconstructing some of the parts that most of us take for granted or had to piece a bunch of reference docs together to figure out.

Demystifying Observability 2.0

A look back at some of the possible missteps of “Observability 1.0” and a new framing for our favorite technology domain. This post carries even more weight if you’re already familiar with OpenTelemetry.

Remote power monitoring

Any other setting and this would be just another friendly guide for monitoring your home systems with open source software. Under the fog of war you quickly appreciate the impact of even a small project like this one.

All you need is Wide Events, not “Metrics, Logs and Traces”

Probably one of the best articles I’ve read in a while, managing to glue together the Observability pillars (and more) in a way that clears up a lot of confusion and angst within the tech community.

Transitioning to OpenTelemetry

One of the best posts I’ve read on adopting OpenTelemetry. I especially appreciate the honest comparisons between logging and tracing.

The Problem with OpenTelemetry

A part of me empathizes with this complaint, but I also feel like it’s unrealistic to design something as vendor-agnostic as OpenTelemetry and still be optimized for all possible vendor use cases.

CLP on JSON: High Compression and Fast Search on Dynamically-Structured Logs

An impressively deep dive on CLP-JSON, an extension to the Compressed Log Processor (CLP) with support for lossless compression of dynamically structured (JSON) logs.

Deconstructing Retina

Somehow I completely missed the announcement from Microsoft open sourcing the Retina project. This post offers a super quick look at the project and where to find more details.

Stop paying for luxury monitoring

Making the case that synthetic monitors are good enough for smaller businesses who may not have the budget for a commercial observability product. Woof.

Adopting OpenTelemetry for our logging pipeline

Cloudflare engineers have a history of sharing their monitoring architectures and challenges. This marks another great entry in their series of behind-the-scenes posts, this time focusing on OpenTelemetry log collection. If this interests you, take note of the job posting further down in this issue.

Learned it the hard way: Don’t use Cilium’s default Pod CIDR

Funny how network misconfigurations always feel obvious in hindsight. Excellent debugging story, props to the author for sharing.

Don’t Get Lost in the Metrics Maze: A Practical Guide to SLOs, SLIs, Error Budgets and Toil

An overview of service level concepts with some practical examples for error budgets.

Virtualizing Our Storage Engine

Another look at query performance improvements, this time related to changes within Honeycomb’s internal storage service.

Observability for The Absolute Beginner

A primer for anyone new to Observability concepts with some just-below-the-surface-level tips and cautions for each pillar technology.

How to write useful logs

A quick collection of logging best practices and tips.

Loki 3.0 release

Hard to believe it’s already been five years since Loki’s original announcement, but here we are celebrating Loki’s third major release. I think the biggest improvement is probably native OpenTelemetry support (Bloom filters are nice, but imho an implementation detail that users shouldn’t have to care about).

The Need for Structured Logging: Introducing the Hiver Python Logging Package

An exploration of Hiver’s need for a custom logging library to support their developers while addressing many of the shortcomings of traditional logging libraries and formats.

Minimizing on-call burnout through alerts observability

A look at how Cloudflare engineers maintain a healthy environment for on-call engineers through observability and analysis of their monitoring systems. I really enjoy seeing “behind the curtain” at other companies that have to deal with alerting noise at scale.

Best Practices for Using DORA Metrics to Improve Software Delivery

If you’re not already using DORA metrics (why not?), you should check out this primer from Datadog. An informative post, even if you’re not using their product.

Scaling the Grafana Observability Stack

This one reads less like a guide and more like a collection of notes from a successful migration to the LGTM stack, but it’s still a good jumping-off point for anyone considering this route.

Event Collector: Your first Rust Application

I love this example for writing your own event collector. It’s less of a practical application than a learning opportunity for coding in Rust, but what a fun way to hone your skills.

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor