Week 40 note: On alerts that I like

06 Oct, 2024

Last week I ran into alerts a lot and experienced alert fatigue. It got me thinking what are the sort of alerts I like to see.

Service-Level Objectives

A Service-Level Objective (SLO) is my favorite type of alert.

A Service-Level Objective measures your business. It's actionable; when an SLO has been triggered, you can be certain that your service is down. There is no question whether the service should or shouldn't be fixed. It's experiencing problems, go fix it! (Yes, sometimes the alert is broken, but that's a different story. Always actionable, it is.)

SLOs can produce false positives. Your services might not get invoked much during the night or you might be servicing in a low-use region such as sa-east-1 (from your perspective). If there are errors occurring on these low-use instances, then SLOs may trigger and cause unnecessary toil for on-callers.

SLOs can also produce false negatives. This can occur when you don't measure your business/service end-to-end (at least, to a sufficient degree). For example, in a previous team we had to create an SLO for our nginx ingress because our application SLOs showed everything to be fine. Business, however, was suffering.

SLOs typically measure a constant objective over a period of time, eg. 99.9% over a period of 1 month. However, some have argued for different objectives depending on the time of day - say, 99.9% during business hours, 90% during the night. I find this to be an alluring thought, maybe night time is not that valuable for a particular business and there is a need to reduce costs on nightly on-call. With all of this said, I'm not sure how mature the tools are for measuring SLOs this way. Maybe people are not ready for this either, or maybe it's just not a good idea. Either way, the world is not ready for this.

Threshold-based alerts

These are alerts that indicate that very soon something is about to blow in my infrastructure. Maybe it's the disk space that is being utilized at 95% rate and I should do something about it soon. Or maybe it's that my SMS/email threshold is reaching the cap limit and I'm soon unable to send any messages to my customers.

Definitely potential SEV-1 scenarios where you want to be proactive. Service is about to go down soon, in a big way. Better act!

Rate-based alerts

These can be alerts that warn you of higher usage of the API or warn of unusual activity. They can be a useful for, say, spotting if somebody is scanning through your userbase.

Rate-based alerts can produce false positives, for example, during a campaign where traffic is unusual. The alerts can get noisy, so I only like them the first time I see them during high activity periods.

Security alerts

Eg. Snyk alerts integrated via Slack tell me what the vulnerability is and allows me to triage the alerts effectively. But there are also other types of security alerts that are based on eg. statistical outlier activity from logs of your services (either app or AWS services).

Conclusion

There's nothing that beats a good SLO, but other alert types are useful as well.

Why I love all of these alert types is that they tend to indicate clear problems. Nothing is as frustrating as seeing a bunch of alerts in your Slack channel or inbox and not being able to say which ones indicate real problems. To me, a single error is not an incident but 1% of errors is very likely to be (a minor one, but an incident regardless). Alerts should be tuned to realize this.

#sre