Issue 066

Announcing my new video course: Monitor Anything

How do you improve monitoring, specifically? Where do you even start? Worse: how do you know you’re done? If this resonates, I’ve got something in the works you’re going to love: a foolproof framework for how to monitor any app, service, or infrastructure. Read more about it and pre-order the course here.

Articles & News

Effective Management of High Volume Numeric Data with Histograms | DataEngConf SF ‘18 (video)

Histograms are awesome. You should use them. This talk goes into more detail about how, why, and their different types.

Evolution of Telemetry at Bloomberg (video)

I’m always a big fan of these types of talks and hearing how a team/company goes through a radical evolution of something. This one is the folks at Bloomberg, who overhauled and consolidated all of their metrics tools into a single company-wide platform. Definitely worth a watch.

Be A Grafana/Graphite Power User

This a list of really awesome tips about improving how you use Grafana. Seriously, it’s a great list.

OpenAPM: Explore and Design APM Solutions Based on Open Source Software

A neat visualization of different open-source APM tools and how they fit into a complete solution.

Monitoring Java Spring Boot applications with Prometheus: Part 1, Part 2

Got Spring Boot applications laying around? Considering/currently using Prometheus? This two-parter walks through two scenarios: instrumenting the code directly and monitoring black boxes (for when you can’t change the code).

Scaling a monitoring platform

Monitoring the awful horribleness that is the banking industry has always fascinated me (I’m a masochist, clearly), so this post from the folks at Plaid got my attention. They take us through how they chose the components for the next iteration of their monitoring platform and how it all fits together to monitor 9600+ banks.

Metal God ⚔️🎸 on Twitter: “ACHIEVEMENT UNLOCKED: I just learned how to get Splunk to automatically order a pizza when an alert fires. I have a whole new respect for this tool.”

Left without comment…maybe because I’m now trying to figure out how to make this a thing in other tools. I asked around my Splunk friends and this integration is apparently a few years old and built by Dominos themselves. So that’s cool.

Sri Harsha Kalavala on Twitter: “If you haven’t implemented alerts on support page views yet, do it now!! …

This idea made the rounds on Twitter this week: set up an alert on increased page views of your support site or status page as an early warning mechanism that your customers are experiencing something wrong. I love the idea, but be careful about the noise the alert may generate. Also: not a bad way to create a DoS on someone’s support team: just hit their status page with an automated curl script. :/

Google Cloud Platform Blog: Understanding error budget overspend - part one - CRE life lessons

There’s one paragraph in this that’s super important and relevant for you folk (the one about metrics): error budgets are tied to customer-impacting errors, which means the metric(s) you use to determine errors must be an accurate portrayal of customer impact. But, sometimes it isn’t and you’ve got the wrong metric(s). This is harder than it sounds to do well when you’re running complex applications, and even more when you’re trying to come up with SLOs and error budgets for internal systems. Even trickier is understanding at what level of errors equals actual customer impact: if you drop one request, do customers notice? What if it’s ten? A thousand? Do you have data to back that up? Moral of the story: arbitrary SLOs are bad, m’kay.

Can We Ever Escape From Data Overload? A Cognitive Systems Diagnosis

If you’re into reading academic papers, here’s one I found via John Allspaw this weekend: a diagnosis of why “too much data” is such a difficult problem to solve and why it seems to just be getting worse.

Tools

Grafana v5.2 Released

There’s a couple really cool bits in this release: alerting for Elasticsearch datasources and native Grafana builds for ARM. That second one is something I’ve been waiting on–I run Grafana on ARM devices all the time so I love that there’s now a native build for it.

google/flogger: A Fluent Logging API for Java

If you’re looking for ways to improve your logging with Java applications, this logging library from Google might do the trick. They explain more about it in the readme, but the gist is that they’ve combined all of their Java logging libraries into one standard that aims to solve most, if not all, of their own pain points with logging in Java.

Events

Icinga Users Group - July 19, 2018 - Ludwigsburg, Germany

If you’re in the Stuttgart, Germany area, there’s an Icinga meetup coming up soon.

Sensu Summit 2018 - August 22-23, 2018 - Portland, OR USA

Sensu has graciously offered a discount code for all Monitoring Weekly readers! Use MonitoringWeekly at checkout for $50 off the early bird ticket.

See you next week!

– Mike (@mike_julian) Monitoring Weekly Editor