SPECIAL EDITION: Q1 2019 Best of
This issue is sponsored by:
Monitoring & Remediation - Know What You Must, When You Need It, & With the Alerts You Require.
Managing a hybrid network, with its growing complexity? For 1 competitive per-instance price Panopta offers the peace of mind of 360 degree coverage - from bare metal to virtual machines, from your on-prem servers to the furthest reaches of the cloud. Incident remediation available too! “
Latest on monitoring.love
Real World DevOps - The Science Behind DevOps with Dr. Nicole Forsgren
I had the pleasure of interviewing Nicole Forsgren about the science behind DevOps and her work with the annual State of DevOps Report. There’s some fascinating stuff in here, including her top three favorite takeaways. Also, don’t forget to take the 2019 survey!
Real World DevOps - Database Performance With A Side of Empathy with Baron Schwartz
Baron Schwartz joins me for a delightful conversation about his technique for writing books, his thoughts on culture change, and wrapping up with a very real conversation about bias, privilege, and empathy that you won’t want to miss.
Observability & Monitoring Community Slack
Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.
From The Community
9 Logging Best Practices Based on Hands-on Experience
There are some great tips in here, but perhaps my favorite is the implication that you don’t always want your logs in a structured format if your goal is human consumption of them. As good as I am at computers, I can’t read nested JSON easily, you know.
Because outages can also be beautiful.
I freaking love this thread from Sarah Mei on on-call. One of the takeaways is something we’re finally starting to see a little bit of movement on: on-call should be paid, above-and-beyond your standard pay, and it should be done so whether you’re paged or not. There are a whole lot of other great points in the thread, so I recommend clicking through and reading the whole thing.
Super neat visualization for understanding communication patterns in distributed systems, but if this doens’t make you start regretting breaking up your monolith, I don’t know what will.
Logs vs Structured Events by Charity Majors
Based on the title, this sounds like just another “structure your logs, m’kay” article but it’s even better than that: why your teammates might be against structured logging and how to convince them. I particularly like the observations on philosophy of logging for monolithic applications vs distributed applications.
SRE Observability: Metric Namespaces and Structures
Namespacing metrics is hard, yo.
csabapalfi/awesome-web-performance-metrics: List of awesome web performance
A whole bunch of web performance metrics (and what they mean) and tools for collecting+analyzing them.
If you’re new to Prometheus’s query language, PromQL, this is a super handy guide.
This issue is sponsored by:
Everyone wants stats-based alerting, but it’s not always straightforward to do. InfluxDB’s Holt-Winters support is pretty great though, and easy to use. Learn more about it here.
My thanks to InfluxData for their support of Monitoring Weekly.
Three Pillars, Zero Answers: We Need to Rethink Observability
Could it be that the industry’s fascination with the “three pillars of observability” is incorrect, misguided, or at least incomplete? Ben Sigelman, co-founder of LightStep, makes a damn good article that we’re focused on the wrong thing.
Google’s Site Reliability Workbook now available for free in HTML format
Observability?! – Where do we go from here?
This article hits on a wide range of emerging trends and challenges in observability.
I’m John Allspaw, Ask Me Anything about incident analysis and postmortems
For those that know of John Allspaw, you’re probably already clicking this link hard. For those that don’t know him and his work, he’s an expert in incident analysis, post-mortems, and human factors/systems safety. Also, you should be clicking this link hard and gorging yourself on the incredibly helpful stuff in here.
RUM vs. APM: How They’re Similar and Different
This is an interesting take on things: RUM is more of a technique or way of monitoring a specific thing, whereas APM is a much broader category that encompasses RUM.
Not all metrics are infrastructure-related or deep in the code–many of the most important ones are higher-level. This article talks about what makes a good (business-level) metric.
Linux Kernel Observability through eBPF
There are two kinds of people in the world: those who love eBPF and those who haven’t used it yet.
How Much Should My Observability Stack Cost?
I’ll spoil it for you: you’re spending more than you think and that’s probably still not enough.
This issue is sponsored by:
20 Patterns to Watch for in Engineering Teams
GitPrime’s new book draws together some of the most common software team dynamics, observed in working with hundreds of enterprise engineering organizations. Actionable insights to help you debut your development process with data. Get Your Copy.”
Measuring Wikipedia page load times
Frontend monitoring doesn’t get enough love, in my opinion, so be sure to read this article and enjoy it–it’s quite useful.
How Dashboards are Changing Human Behavior in DevOps
Dashboards get a lot of flak these days, but I think it’s telling that the people throwing the shade at the concept of “I have dashboards to tell me things” are also those working in very advanced, technically-mature, small environments. The truth is that dashboards are an incredibly valuable asset, and as this article points out, helped IBM to start tearing down silos. Dashboards are great, y’all.
How We Built an Automated Anomaly Detection System onto a Streaming Pipeline
A look under the hood of some interesting Salesforce engineering.
Scaling up reporting on high-cardinality metrics
For those of you working on high-volume backend systems, you’ll like this article from the folks at Segment.
I love a good monster post and this one certainly hits home. I especially love the focus on mental models.
Six Simple Steps to Service Level Objectives (SLOs)
“Marie Cosgrove-Davies covers a user-focused approach to SLOs and some common pitfalls that teams encounter when they’re first trying to adopt SLO methods.”
Performance monitoring with OpenTracing, OpenCensus, and OpenMetrics
If you were starting to get confused by the silly ‘OpenWhatever’ naming patterns, this article from the folks at Datadog does a great job of explaining what the hell is going on.
Pro Tips: How Booking.com Handles Millions of Metrics Per Second with Graphite
From the article: “Over the years, Booking.com’s Graphite grew to consist of hundreds of servers.” … “It ingests more than 10 million unique points per second” Holy crap.
This issue is sponsored by:
On-call can seriously suck - but it doesn’t have to! Learn how to set up better on-call scheduling, improve your alerting strategy, and more with this ebook from Raygun. Grab a copy of it right here.
See you next week!
– Mike (@mike_julian) Monitoring Weekly Editor