W3C Trace Context and traceparent in Clojure

11 Nov, 2021

Let's say that we're using some Clojure logging library and we've configured our log events to be outputted in structured format using JSON.

(log/info) gets you:

{
  "hostname": "Alkio.local",
  "level": "INFO",
  "src": "structured-logging:2",
  "thread": "nRepl-session-c51e65db-947f-4e07-bc3e-d64538c84fae",
  "ts": "2021-11-09T17:31:36.667+0000"
}

What we'd like to do next is to be able to group together all of the log events in our different services to help us identify which requests belong together in the log stream(s). While we could use the thread field to give us all the logs related to that one thread, it won't solve the problem in multi-threaded scenarios nor will it grant us the ability to find the related logs across all of the services.

For that, we need to add trace context to our log events.

There are two recommended standards for passing a trace context: W3C Trace Context or B3. This post has been written with W3C Trace Context in mind. In W3C Trace Context, the main identity fields Trace ID, Span ID, and Parent ID are packaged under a HTTP header called traceparent which should be sent from service to service. These fields won't be enough for us to form a proper distributed trace (so, no waterfall style trace visualization), but they'll be enough to group the logs together and to pass the trace context around.

The traceparent header has the following ABNF definition:

HEXDIGLC = DIGIT / "a" / "b" / "c" / "d" / "e" / "f"
value           = version "-" version-format
version         = 2HEXDIGLC
version-format   = trace-id "-" parent-id "-" trace-flags
trace-id         = 32HEXDIGLC
parent-id        = 16HEXDIGLC
trace-flags      = 2HEXDIGLC

Example: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-00

The header value consists of four components (separated by dashes):

version: Version of the W3C Trace Context standard. "00" is the only version supported by the standard currently.
trace-id: The ID of the entire distributed trace ("0af7651916cd43dd8448eb211c80319c"). SHOULD be unique on a system level, eg:

(defn gen-trace-id
  []
  (str/replace (str (java.util.UUID/randomUUID)) #"-" ""))

parent-id: Span ID from previous span ("b7ad6b7169203331"). It can also be null ("0000000000000000") when trace context is generated for the first time, eg:

(defn gen-span-id
  []
  (format "%016x" (.nextLong (ThreadLocalRandom/current))))

trace-flags: An 8-bit flag ("00"). Currently just "00" (non-sampled) and "01" (sampled) supported. Log sampling is a topic of its own and useful when you are experiencing logging hotspots. In those cases, you will want to look into sampling techniques.

To implement trace context, we could add the fields explicitly by passing the trace context inside a context map. Alternatively, we could choose an implicit approach where we pass the trace context within a binding context using dynamic vars.

The implicit approach may seem easier to tag on to an existing setup, but there are many caveats. For example, you'll need to always make sure that your lazy sequences are always realized within the binding context with doall or mapv style functions (or risk losing the trace context from your logs). You also need to build helper functions to support binding context inheritance for Java's Threads and ThreadPools and likely for core.async too.

Here's how a middleware might look like using the implicit approach:

(defn get-trace-context
  [{:keys [headers]}]
  (or (some-> (get headers "traceparent")
        (parse-traceparent)
        (gen-trace-context))
      (gen-trace-context)))

(defn wrap-trace-context
  ([handler]
   (fn
     ([request]
      (let [{:keys [version trace-id parent-id span-id trace-flags]} (get-trace-context request)]
        (with-bindings* {#'*version* version
                         #'*trace-id* trace-id
                         #'*parent-id* parent-id
                         #'*span-id* span-id
                         #'*trace-flags* trace-flags}
          handler request)))
     ([request respond raise]
      (let [{:keys [version trace-id parent-id span-id trace-flags]} (get-trace-context request)]
        (with-bindings* {#'*version* version
                         #'*trace-id* trace-id
                         #'*parent-id* parent-id
                         #'*span-id* span-id
                         #'*trace-flags* trace-flags}
          handler request respond raise))))))

In get-trace-context, we try to create a trace context from traceparent header. If we're unable to create one, we'll generate trace context ourselves - ideally, this should happen for root spans only. Realistically, you're likely to see root span trace contexts out in the wild until you're fully passing the trace context within your system.

(Here I've left the definitions of parse-traceparent and gen-trace-context as an exercise for the reader.)

You'll still need to configure your logging library to amend the trace context fields to the JSON map, but once you've done that, your log event should look like this:

{
  "hostname": "Alkio.local",
  "level": "INFO",
  "parent-id": "0000000000000000",
  "span-id": "e558fd2d502f764c",
  "src": "structured-logging:4",
  "thread": "nRepl-session-c51e65db-947f-4e07-bc3e-d64538c84fae",
  "trace-id": "37b325e4cc3f40aabd422f3b25761dec",
  "ts": "2021-11-09T17:31:36.674+0000"
}

trace-id is the most important field for grouping log events together. span-id is useful in cases where you need to isolate the activities of one service during analysis. Having parent-id there is useful to see when your trace context is not being correctly generated and/or passed — it can serve as a reminder to implement traceparent passing where it is not being passed.

Don't forget to pass the traceparent header and its value whenever you are making requests to internal upstream services.

Conclusion

Using trace context for structured logs is a good first step toward observability. If you are having difficulty grouping even a single HTTP request's log events together then this may be helpful. And if you're still logging multi-line logs, you could also think about migrating to structured logging format.

Using the binding context (the implicit approach), you have to be careful to see that you aren't losing your trace context from your logs. Losing it is not critical, but something to be aware of at all times.

Once you've implemented support for Trace IDs throughout your system, you can start to forward the Trace ID to the frontend all the way to the error page for the end customer to copy & paste. This will help reduce the amount of toil spent hunting the offending bit(s) from the logs.

In the long run, you should think about adopting an observability service (managed or self-hosted) and start using proper distributed traces. I recommend reading this book about observability engineering for ideas about further adopting observability.

#clojure #observability #programming