Distributed Tracing

An Observability technique that tracks a request as it traverses multiple services in a distributed system, recording the timing and metadata of each operation as spans within a unified trace, enabling end-to-end visibility into request flows.

Problem

In a microservices architecture, a single user request may touch ten or more services. When a request is slow or fails:

  • Logs from each service are separate; correlating them manually is tedious.
  • Metrics show that latency increased but not where in the chain.
  • It is impossible to visualize the causal chain of calls without purpose-built tooling.

Solution / Explanation

Distributed tracing assigns a unique Trace ID to each request at its origin. Every service that handles the request creates a Span — a named, timed operation — and attaches the Trace ID to it. Spans are sent to a tracing backend that assembles them into a tree showing the full request lifecycle.

Core Concepts

Trace The entire journey of a request through the system. Represented as a tree of spans. Has a globally unique Trace ID.

Span A single unit of work within the trace (e.g., “handle HTTP request”, “query database”, “call payment service”). Each span records:

  • Operation name
  • Service name
  • Start time and duration
  • Status (OK / error)
  • Key-value attributes
  • Parent Span ID (to build the tree)

Context Propagation The mechanism for passing the Trace ID (and Span ID) across service boundaries. The W3C Trace Context standard defines traceparent and tracestate HTTP headers for this purpose.

Baggage Key-value pairs that travel alongside trace context across all services (e.g., tenant ID, user ID). Useful for correlating traces with business context.

Example Trace Tree

Trace: a3b2c1d4
├── [0ms - 250ms] API Gateway: POST /checkout
│   ├── [5ms - 80ms]  Order Service: CreateOrder
│   │   └── [10ms - 40ms] DB: INSERT orders
│   ├── [85ms - 180ms] Inventory Service: ReserveItems
│   └── [185ms - 245ms] Payment Service: ProcessPayment
│       └── [190ms - 240ms] External: Stripe API

OpenTelemetry

The de-facto standard for distributed tracing instrumentation, formed by merging OpenTracing and OpenCensus into a single CNCF project. Provides:

  • API — vendor-neutral interfaces for creating spans.
  • SDK — implementation with samplers, exporters, processors.
  • OTLP — wire protocol for exporting telemetry to backends.
  • Instrumentation libraries — automatic tracing for popular frameworks (HTTP, gRPC, DB drivers).
  • Collector — a standalone agent/gateway for receiving, processing, and exporting telemetry.

Common backends: Jaeger (CNCF-graduated, OpenTracing-compatible), Zipkin, Grafana Tempo, AWS X-Ray, .NET Aspire Dashboard.

When to Use

  • Any microservices or distributed system where latency or failures are hard to trace manually.
  • Systems with SLO requirements where bottleneck identification matters.
  • Post-incident analysis of cross-service failures.

Trade-offs

BenefitDrawback
End-to-end visibility across servicesRequires instrumentation investment
Pinpoints latency bottlenecks preciselyHigh cardinality (many unique trace IDs) → storage cost
Correlates logs, metrics, and tracesContext propagation must be implemented in every service
Standard (OpenTelemetry) reduces vendor lock-inSampling decisions affect trace completeness