SPECIAL EDITION: Q1 2019 Best of

This issue is sponsored by:

Panopta logo Monitoring & Remediation - Know What You Must, When You Need It, & With the Alerts You Require.

Managing a hybrid network, with its growing complexity? For 1 competitive per-instance price Panopta offers the peace of mind of 360 degree coverage - from bare metal to virtual machines, from your on-prem servers to the furthest reaches of the cloud. Incident remediation available too! “

Latest on monitoring.love

Real World DevOps - The Science Behind DevOps with Dr. Nicole Forsgren

I had the pleasure of interviewing Nicole Forsgren about the science behind DevOps and her work with the annual State of DevOps Report. There’s some fascinating stuff in here, including her top three favorite takeaways. Also, don’t forget to take the 2019 survey!

Real World DevOps - Database Performance With A Side of Empathy with Baron Schwartz

Baron Schwartz joins me for a delightful conversation about his technique for writing books, his thoughts on culture change, and wrapping up with a very real conversation about bias, privilege, and empathy that you won’t want to miss.

Observability & Monitoring Community Slack

Come hang out with all your fellow Monitoring Weekly readers. I mean, I’m also there, but I’m sure everyone else is way cooler.

From The Community

9 Logging Best Practices Based on Hands-on Experience

There are some great tips in here, but perhaps my favorite is the implication that you don’t always want your logs in a structured format if your goal is human consumption of them. As good as I am at computers, I can’t read nested JSON easily, you know.

phenomenal outages

Because outages can also be beautiful.

Sarah Mei on Twitter: “My fundamental issue with being on call is that I care more about my personal life & health than I do about whether my employer’s website is operational.”

I freaking love this thread from Sarah Mei on on-call. One of the takeaways is something we’re finally starting to see a little bit of movement on: on-call should be paid, above-and-beyond your standard pay, and it should be done so whether you’re paged or not. There are a whole lot of other great points in the thread, so I recommend clicking through and reading the whole thing.

ShiViz

Super neat visualization for understanding communication patterns in distributed systems, but if this doens’t make you start regretting breaking up your monolith, I don’t know what will.

Logs vs Structured Events by Charity Majors

Based on the title, this sounds like just another “structure your logs, m’kay” article but it’s even better than that: why your teammates might be against structured logging and how to convince them. I particularly like the observations on philosophy of logging for monolithic applications vs distributed applications.

SRE Observability: Metric Namespaces and Structures

Namespacing metrics is hard, yo.

csabapalfi/awesome-web-performance-metrics: List of awesome web performance

A whole bunch of web performance metrics (and what they mean) and tools for collecting+analyzing them.

PromQL tutorial for beginners

If you’re new to Prometheus’s query language, PromQL, this is a super handy guide.

This issue is sponsored by:

InfluxData logo Everyone wants stats-based alerting, but it’s not always straightforward to do. InfluxDB’s Holt-Winters support is pretty great though, and easy to use. Learn more about it here.

My thanks to InfluxData for their support of Monitoring Weekly.

Three Pillars, Zero Answers: We Need to Rethink Observability

Could it be that the industry’s fascination with the “three pillars of observability” is incorrect, misguided, or at least incomplete? Ben Sigelman, co-founder of LightStep, makes a damn good article that we’re focused on the wrong thing.

Google’s Site Reliability Workbook now available for free in HTML format

Observability?! – Where do we go from here?

This article hits on a wide range of emerging trends and challenges in observability.

I’m John Allspaw, Ask Me Anything about incident analysis and postmortems

For those that know of John Allspaw, you’re probably already clicking this link hard. For those that don’t know him and his work, he’s an expert in incident analysis, post-mortems, and human factors/systems safety. Also, you should be clicking this link hard and gorging yourself on the incredibly helpful stuff in here.

RUM vs. APM: How They’re Similar and Different

This is an interesting take on things: RUM is more of a technique or way of monitoring a specific thing, whereas APM is a much broader category that encompasses RUM.

What is a Good Metric?

Not all metrics are infrastructure-related or deep in the code–many of the most important ones are higher-level. This article talks about what makes a good (business-level) metric.

Linux Kernel Observability through eBPF

There are two kinds of people in the world: those who love eBPF and those who haven’t used it yet.

How Much Should My Observability Stack Cost?

I’ll spoil it for you: you’re spending more than you think and that’s probably still not enough.

This issue is sponsored by:

GitPrime logo 20 Patterns to Watch for in Engineering Teams

GitPrime’s new book draws together some of the most common software team dynamics, observed in working with hundreds of enterprise engineering organizations. Actionable insights to help you debut your development process with data. Get Your Copy.”

Measuring Wikipedia page load times

Frontend monitoring doesn’t get enough love, in my opinion, so be sure to read this article and enjoy it–it’s quite useful.

How Dashboards are Changing Human Behavior in DevOps

Dashboards get a lot of flak these days, but I think it’s telling that the people throwing the shade at the concept of “I have dashboards to tell me things” are also those working in very advanced, technically-mature, small environments. The truth is that dashboards are an incredibly valuable asset, and as this article points out, helped IBM to start tearing down silos. Dashboards are great, y’all.

How We Built an Automated Anomaly Detection System onto a Streaming Pipeline

A look under the hood of some interesting Salesforce engineering.

Scaling up reporting on high-cardinality metrics

For those of you working on high-volume backend systems, you’ll like this article from the folks at Segment.

Operable Software

I love a good monster post and this one certainly hits home. I especially love the focus on mental models.

Six Simple Steps to Service Level Objectives (SLOs)

“Marie Cosgrove-Davies covers a user-focused approach to SLOs and some common pitfalls that teams encounter when they’re first trying to adopt SLO methods.”

Performance monitoring with OpenTracing, OpenCensus, and OpenMetrics

If you were starting to get confused by the silly ‘OpenWhatever’ naming patterns, this article from the folks at Datadog does a great job of explaining what the hell is going on.

Pro Tips: How Booking.com Handles Millions of Metrics Per Second with Graphite

From the article: “Over the years, Booking.com’s Graphite grew to consist of hundreds of servers.” … “It ingests more than 10 million unique points per second” Holy crap.

This issue is sponsored by:

Raygun logo On-call can seriously suck - but it doesn’t have to! Learn how to set up better on-call scheduling, improve your alerting strategy, and more with this ebook from Raygun. Grab a copy of it right here.

See you next week!

– Mike (@mike_julian) Monitoring Weekly Editor