Issue 138
Some common themes this week around on-call, incident response, IoT sensor collection, and Java application monitoring. Buckle up! 📡🌈📈
This issue is sponsored by:
Start incident response with context to all your alerts in one view
Moogsoft speeds up incident response with dynamic anomaly detection, suppressed alert noise, and correlated insights across all your telemetry data. Go from debugging across multiple tools, screens, and dashboards into a single incident view so you and your teams can take a more proactive approach to reduce MTTR. Sign up for the Moogsoft Free community plan today!
Articles & News on monitoring.love
Observability & Monitoring Community Slack
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
Groot: eBay’s Event-graph-based Approach for Root Cause Analysis
Whether or not you believe the “single root cause” exists, eBay’s Groot event-graph-based approach to RCA demonstrates some extremely impressive numbers for their causality graphs. The whitepaper on Groot’s design (in partnership with University of Illinois Urbana-Champaign and Peking University) can be found here.
Maintenance windows are a mistake
I have a lot of conflicting feels on this one. Yes, I agree with the author’s take, but I also recognize that not everyone has the resources or freedom to proritize High Availability for their entire architecture. Here’s a terrible thought… are you better off taking a service down for maintenance without notifying your customer?
First time SRE, on call discovering
This article reminds me of all the insecurities and chaos I felt joining an on-call rotation for the first time. Fortunately, the author provides some key take-aways and tips for preparing for your own encounter with the pager.
Custom Prometheus Metrics with Go
A handy article for developing your own custom Prometheus exporter in Go. The code examples use a commercial weather API service, but you can grok the pattern without using that resource.
SLI’s and SLO’s, how to wrap your head around it and actually use them to calculate availability
Most of us have a passing understanding of SLIs, SLOs, and how they feed into SLAs. Unfortunately, many of us still struggle with the question of how to leverage them for availability numbers and error budgets. This post aims to answer these for us.
Best Practices for Writing Incident Postmortems
Incident responses can be a chaotic experience for everyone. This post from Datadog highlights some best practices for collecting your data in preparation for writing the postmortem.
The IR Mindset (Part 2: Practical Approach)
A more hands-on (and vendor agnostic) approach to organizing your incident response information and data.
Chronosphere is the only observability platform that puts you back in control by taming rampant data growth and cloud-native complexity, delivering increased business confidence. Teams at enterprises, large cloud-native, and mid-market companies around the world trust Chronosphere to help them operate scalable, highly available, and resilient applications. Learn more at https://chronosphere.io. (SPONSORED)
How to visualize real-time data from an IoT smart home weather station with Grafana dashboards
I never get tired of seeing all of the creative ways that folks use Grafana for their personal use. I’ve never considered running my own weater station before (narrator: he might) but it’s nice to know I could self-service my IoT data collection and visualization.
Zabbix External checks by example
Speaking of pulling external sensor data, this might be the first example I’ve seen for doing this with Zabbix.
How to set up monitoring tools for Java application
I don’t work with Java, but I’m glad to see that their ecosystem plays nicely with tools like Prometheus and Grafana. Some good examples here for profiling your Java bits with Java Flight Recorder and JDK Mission Control.
Distributed Tracing with Spring Cloud Jaeger
A simple walkthrough for tracing your Spring Java application with Jaeger.
Async stack traces in folly: Introduction
I’m not a fan of Facebook, but there’s no denying they have some brilliant engineers over there. Async stack traces seem like one of those things that are simple in concept, but highly complicated (and confusing af) in practice. Maybe it’s just me. 😆
Tools
“A WeatherFlow data collector for local-UDP, remote-socket, and remote-rest APIs. Feeds InfluxDB and Grafana Loki back-ends. Includes current conditions, forecasts, and historical details.”
Job Opportunities
Senior DevOps Engineer at LeafLink (US, NYC)
Site Reliability Engineer at Classkick (Remote)
Senior Cloud Engineer at Recurly (Remote)
Negotiating your AWS contract? Let us help. At The Duckbill Group, we’re on your side and we see dozens of these a year–more than most AWS account managers! We’ve helped negotiate everything from $3mm contracts to $650mm contracts and a whole slew in between. Check out our AWS contract negotiation services. (SPONSORED)
See you next week!
– Jason (@obfuscurity) Monitoring Weekly Editor