The Six Ws of Observability
At work, we were discussing observability as a topic at a backend meetup. In one of the talks, one lesson was that "events should contain rich context to reproduce". This got me thinking what could be considered sufficiently rich.
Tongue in cheek, I quickly modeled a (non-exhaustive) list after the Five Ws, a checklist from journalism that's used to ensure that a lede contains the gist of a story:
- Who initiates the event?
- Examples: User ID or other user-correlated surrogate key(s), traceparent (parent ID), authorization context
- What is the outcome of the event?
- Examples: HTTP status, return value (eg. body response), error(s), changes in state (if any)
- When does an event occur?
- Examples: A timestamp (with date and used timezone), duration of the event
- Where does the event occur?
- Examples: Correlation IDs (ie. traceparent, X-Ray ID, custom IDs), thread and/or process ID, region/AZ name, service/node/pod/etc. name, source code location, IP address, geolocation
- Why does the event happen?
- Examples: An event ID (
my-user-service/create-new-user
) or a message ("Creating new user")
- Examples: An event ID (
- Which circumstances affect the event?
- Examples: Used inputs/variants (omitting input that does not meaningfully change the outcome), any stochastic inputs, feature flags (not) in use
- One should also include saturation signals here such as host metrics (eg. CPU, memory, IO saturation), GC statistics, DB metrics, platform (eg. AWS, GCP) service metrics, etc. However, this type of data can be harder to carry as part of an event.
I wonder if this type of a checklist could be used to provide sufficiently rich events?
P.S. Thanks to Antti Poutiainen, our resident SRE and DNS chaos artist at SOK, for some of the additions to this list.