SPECIAL EDITION: Q3 2023 Best of
It’s time for another “best of” issue! This collection looks back through our summer months at our most popular articles. Hope you enjoy it!
This issue is sponsored by:
Alerting is evolving. Signals is coming soon.
This winter from incident management platform FireHydrant: alerting and incident response in one ring-to-retro tool for the first time. Sign up for the early access waitlist and be the first to experience the power of alerting + incident response in one platform — at last.
Articles & News on monitoring.love
Observability & Monitoring Community Slack
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
Alerting: The Do’s and Don’ts for Effective Observability
This post is sincerely interesting; it starts off almost as a chapter from a novel before pivoting hard into thoughtful considerations for crafting effective alerts.
Why are Prometheus queries hard?
Maybe I’m biased because I’ve used other time series query languages (Graphite, Librato, etc) for many years before Prometheus came along, but I agree… PromQL can be a hassle to master. This post explains why it can feel that way and introduces a new open source project to help make it easier.
Into the Heart of Darkness: Sofia’s Adventure in the Enigmatic Logging Forest
I genuinely can’t tell if this is fan fiction, developer advocacy, or an SRE biopic. Either way, it’s an interesting read.
How Observability Changed My (Developer) Life
A genuine look at observability and its impact on our work from the perspective of a web developer.
Prometheus and Thanos: An Ultimate Alliance for Scalable Metrics
An overview of Thanos, its components, and how it complements Prometheus when horizontal scaling becomes a necessity.
How Monitorama changed our lives — a decade on
A reflection on the early days of monitoring and whether anomaly detection has really gotten us anywhere (my words, not theirs).
The Problem with Timeseries Data in Machine Learning Feature Systems
This might be a bit of a niche concern for our audience, but if you happen to be applying machine learning to your time series data, you’ll probably appreciate reading how Etsy stumbled across some potential issues.
Best practices for avoiding race conditions in inhibition rules
Inhibit rules are an important aspect of alerting but can have unexpected behavior if you don’t fully understand how to configure them properly. If you’re alerting with Prometheus and Alertmanager, you should definitely read this post.
Alertmanager’s Group wait, Group interval and Repeat interval explained
I’ve really enjoyed George Robinson’s articles on Alertmanager use and [somewhat undocumented] behaviors. Here is the last one I found published on his blog, looking at some internal timers and their effects on Alertmanager behavior.
Kubernetes Monitoring: Ensuring Performance and Stability in Containerized Environments
This appears intended to serve as an exhaustive overview of Kubernetes observability; it does a decent job touching on all of the related topics, but you’ll want to perform deeper research on any specific area. Frankly, if you just grabbed all of the section titles they would make a great checklist for your manager. 😜
From Chaos to Recovery: How We Restored Our AWS Microservice After Accidental Deletion at Dolap
We’ve all been there… that moment of realization that you just did something very, very wrong and there’s no way to take it back (in my case, an errant rm -rf /
at an OpenBSD hackathon). Still, this is how we learn from our mistakes and build more resiliency into our systems.
Kubernetes logging best practices
Honestly, the title says it all. Although most of the best practices apply to logging in general, it’s still a good review for anyone using or maintaining logging infrastructure in Kubernetes.
You’re Paying too much for (Cloudwatch) Logs
This post speaks the trade-offs we face with our technology choices. More specifically, it compares logging costs between Cloudwatch, Datadog, and a “custom” solution using AWS components.
Grafana Loki: performance optimization with Recording Rules, caching, and parallel queries
Some excellent tips on Loki performance gained from real-world use and frustration. Reminds me of my old Graphite Tips blog posts.
BPFAgent: eBPF for Monitoring at DoorDash
A detailed look at how DoorDash engineers have iterated on their eBPF agent and probes and where this has paid off in terms of debugging, observability, and for validating system migrations.
Tech Blog: Create Meaningful Logging
We take it for granted that engineers are born knowing how and what to log. This article reminded me that’s not the case, and does a good job covering the reasons we should log, along with examples for develoeprs to apply to their own applications.
Loki seems to be gaining a lot of mindshare in the logging space. Here’s a quick post demonstrating one pattern for storing logs using its API.
Like many of you, I appreciate OpenTelemetry for its capabilities, but even moreso I love it for its ability to protect us from vendor lock-in. This might be its one true killer feature.
How to Extract the Maximum Value From Logs
Now that you’ve got your developers emitting logs on the regular, what next? There’s a lot to stay on top of when managing log aggregation at scale, and this post does a good job listing off a bunch of the considerations.
From Blind Spots to Clear Insights: The Evolution of Observability Tools and Practices at Greenlight
How one fintech company has leaned into Observability through a combination of bespoke in-house tooling and commercial vendors.
What Is “Production-Grade” Software?
As an EM with a team of product developers, this one hits close to home. I expect most of the folks here are strong advocates of these principles, but it might be helpful to share this post with your peers.
A basic walkthrough for setting up Loki with Promtail and Grafana.
What happened to Vivaldi Social?
An entertaining (for readers, anyways) postmortem of the Mastodon service run by Vivaldi. It’s almost always a good learning experience to understand how other admins respond to a service outage.
See you next week!
– Jason (@obfuscurity) Monitoring Weekly Editor