Issue 223
Had a great time last week at Monitorama watching all the talks and seeing so many familiar faces. Feels serendipitous to come back and discover so many stories this week about production incidents, learning from our mistakes, and more. Enjoy! 🌞🍹📈
This issue is sponsored by:
Can you rely on your deployments?
In a recent Armory and Gartner report, 35% of respondents’ top pain point with app deployment is reliability and consistency. If you need help with consistent, reliable deployments, try Armory Continuous Deployment-as-a-Service. Check out more in the reports here.
Articles & News on monitoring.love
Observability & Monitoring Community Slack
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
How Observability Changed My (Developer) Life
A genuine look at observability and its impact on our work from the perspective of a web developer.
Alerting: The Do’s and Don’ts for Effective Observability
This post is sincerely interesting; it starts off almost as a chapter from a novel before pivoting hard into thoughtful considerations for crafting effective alerts.
Developing a data driven tool to estimate the cost of incidents
Most companies I’ve seen struggle with quantifying the impact of an outage. Props to this HelloFresh engineer for sharing how they model incidents and derive actionable insights.
From Chaos to Recovery: How We Restored Our AWS Microservice After Accidental Deletion at Dolap
We’ve all been there… that moment of realization that you just did something very, very wrong and there’s no way to take it back (in my case, an errant rm -rf /
at an OpenBSD hackathon). Still, this is how we learn from our mistakes and build more resiliency into our systems.
Why are Prometheus queries hard?
Maybe I’m biased because I’ve used other time series query languages (Graphite, Librato, etc) for many years before Prometheus came along, but I agree… PromQL can be a hassle to master. This post explains why it can feel that way and introduces a new open source project to help make it easier.
The Problem with Timeseries Data in Machine Learning Feature Systems
This might be a bit of a niche concern for our audience, but if you happen to be applying machine learning to your time series data, you’ll probably appreciate reading how Etsy stumbled across some potential issues.
Preach, we should always strive to learn from (and avoid reoccurences of) our mistakes.
How to run faster Loki metric queries with more accurate results
Some handy tips (and explanations for why they matter) for improving your Loki queries.
Activating Automatical Performance Analysis – Continuous Profiling
A first look at SkyWalking’s new continuous profiling capabilities.
Tools
https://github.com/autometrics-dev
“Autometrics uses instrumented function names to generate Prometheus queries so you don’t need to hand-write complicated PromQL.”
Job Opportunities
Software Engineer, Site Reliability at Redpanda Data (NA Remote)
Senior Staff Site Reliability Engineer at SentinelOne (US Remote)
See you next week!
– Jason (@obfuscurity) Monitoring Weekly Editor