The Three Pillars of Observability Explained
observability monitoring distributed-systems

The Three Pillars of Observability Explained

Published on March 8, 2026 · by PerfBlog Team

Beyond Monitoring: Understanding Observability

Monitoring tells you when something is wrong. Observability tells you why.

In a world of microservices and distributed systems, you need the ability to ask arbitrary questions about your system’s behavior without deploying new code. That’s observability.

Pillar 1: Logs

Logs are timestamped, discrete events that record what happened in your system.

Structured Logging Best Practices

{
  "timestamp": "2026-03-08T14:30:00Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "span_id": "span789",
  "user_id": "user_42",
  "message": "Payment processing failed",
  "error": "insufficient_funds",
  "amount": 99.99,
  "currency": "EUR",
  "duration_ms": 234
}

Key Practices

  • Always use structured logging (JSON) over plain text
  • Include trace IDs for correlation across services
  • Use appropriate log levels (DEBUG, INFO, WARN, ERROR)
  • Add business context (user ID, order ID, amounts)
  • Centralize logs with tools like ELK Stack or Grafana Loki

Pillar 2: Metrics

Metrics are numeric values measured over time, aggregated into time-series data.

The RED Method (for services)

  • Rate: Number of requests per second
  • Errors: Number of failed requests per second
  • Duration: Distribution of response times

The USE Method (for resources)

  • Utilization: Average time resource was busy
  • Saturation: Amount of work queued
  • Errors: Count of error events

Example: Prometheus Metrics

from prometheus_client import Counter, Histogram

REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5]
)

Pillar 3: Traces

Distributed traces follow a request’s journey through multiple services.

Anatomy of a Trace

A trace consists of spans - named, timed operations representing a unit of work:

[Trace: order-checkout]
├── [Span: API Gateway] 2ms
│   ├── [Span: Auth Service] 5ms
│   └── [Span: Order Service] 150ms
│       ├── [Span: Inventory Check] 30ms
│       ├── [Span: Payment Service] 80ms
│       │   └── [Span: External Payment API] 60ms
│       └── [Span: Notification Service] 20ms

OpenTelemetry Implementation

const { trace } = require('@opentelemetry/api');

const tracer = trace.getTracer('order-service');

async function processOrder(orderId) {
  return tracer.startActiveSpan('process-order', async (span) => {
    span.setAttribute('order.id', orderId);

    await checkInventory(orderId);
    await processPayment(orderId);
    await sendNotification(orderId);

    span.setStatus({ code: SpanStatusCode.OK });
    span.end();
  });
}

Connecting the Three Pillars

The real power comes from correlating all three signals:

  1. A metric alert fires: p99 latency > 500ms
  2. Traces reveal the slow span: payment-service database query
  3. Logs from that span show: “Connection pool exhausted, waiting for available connection”

This correlation requires a shared trace ID that flows through all three signals.

Building Your Observability Stack

ComponentOpen SourceCommercial
MetricsPrometheus + GrafanaDatadog, New Relic
LogsGrafana Loki, ELK StackSplunk, Datadog
TracesJaeger, ZipkinDatadog, Honeycomb
All-in-oneGrafana StackDatadog, Dynatrace

Conclusion

Observability is not a product you buy - it’s a property of your system that you build. Start by instrumenting your most critical services with all three pillars, and gradually expand coverage.