Issue 210

Some fun and interesting articles this week. Loving the emphasis on practical monitoring and troubleshooting practices. And flame graphs! 🔥📈🔔

This issue is sponsored by:

Firehydrant logo

Regardless of where you are on your incident management maturity journey, there’s a right next step you can take. Learn about three areas of focus — roles, services, and retros — why they’re important, and how to improve at any level in "3 ways to improve your incident management program in 2023."

Articles & News on monitoring.love

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

Improving Istio Propagation Delay

I absolutely love debugging stories like this one from Airbnb working through some Istio performance issues in production. 🤩

OpenTelemetry and the future of monitoring and observability

Probably one of the more honest appraisals of OpenTelemetry’s strengths and areas for improvement.

Collector Data Flow Dashboard

This engineer not only created a Grafana dashboard for monitoring the OpenTelemetry collector, but they wrote up a page with diagrams explaining the flow and respective metrics. Great work!

Scaling Observability Reliably and Frugally at Magicpin

One company’s journey planning for a new Observability platform, going through the usual considerations like build-vs-buy, open source versus commercial, and how to ensure it could scale with their growth.

From Unstructured Logs to Observability

Unstructured logs continue to be a pillar of Observability for many companies, but they can be so much more. This post shows how a bit of planning can yield richer data using structured events.

Collecting job metrics using Prometheus PushGateway

A look at one of the more common use cases for push metrics in a Prometheus-monitored architecture.

Alerts history in Prometheus

I think most folks here are probably aware of this feature, but it’s a great thing to share with anyone who might be newer to Prometheus and Alertmanager.

AWS Troubleshooting: The Art of Finding the Needle in the Haystack

A broad set of practices for troubleshooting AWS services. Most of this is probably a refresher, but it’s a good reminder to make sure you have your processes and strategies up to date.

Grafana security release: Fixes for CVE-2023-1410

A medium severity security fix affecting the use of Grafana’s Graphite data source.

Kubernetes CreateContainerConfigError and CreateContainerError

The folks at Sysdig take a closer look at a couple common Kubernetes container errors and how you can monitor them effectively.

Tools

rubyatscale/singed

“Singed makes it easy to get a flamegraph anywhere in your code base.”

Events

Monitorama 2023 PDX - Agenda

Monitorama has announced their full agenda for this year’s event. Looks like an awesome collection of topics and speakers. Hope to see you there!

See you next week!

– Jason (@obfuscurity) Monitoring Weekly Editor