The Three Pillars of Observability Explained

Beyond Monitoring: Understanding Observability

Monitoring tells you when something is wrong. Observability tells you why.

In a world of microservices and distributed systems, you need the ability to ask arbitrary questions about your system’s behavior without deploying new code. That’s observability.

Pillar 1: Logs

Logs are timestamped, discrete events that record what happened in your system.

Structured Logging Best Practices

{
  "timestamp": "2026-03-08T14:30:00Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "span_id": "span789",
  "user_id": "user_42",
  "message": "Payment processing failed",
  "error": "insufficient_funds",
  "amount": 99.99,
  "currency": "EUR",
  "duration_ms": 234
}

Key Practices

Always use structured logging (JSON) over plain text
Include trace IDs for correlation across services
Use appropriate log levels (DEBUG, INFO, WARN, ERROR)
Add business context (user ID, order ID, amounts)
Centralize logs with tools like ELK Stack or Grafana Loki

Pillar 2: Metrics

Metrics are numeric values measured over time, aggregated into time-series data.

The RED Method (for services)

Rate: Number of requests per second
Errors: Number of failed requests per second
Duration: Distribution of response times

The USE Method (for resources)

Utilization: Average time resource was busy
Saturation: Amount of work queued
Errors: Count of error events

Example: Prometheus Metrics

from prometheus_client import Counter, Histogram

REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5]
)

Pillar 3: Traces

Distributed traces follow a request’s journey through multiple services.

Anatomy of a Trace

A trace consists of spans - named, timed operations representing a unit of work:

[Trace: order-checkout]
├── [Span: API Gateway] 2ms
│   ├── [Span: Auth Service] 5ms
│   └── [Span: Order Service] 150ms
│       ├── [Span: Inventory Check] 30ms
│       ├── [Span: Payment Service] 80ms
│       │   └── [Span: External Payment API] 60ms
│       └── [Span: Notification Service] 20ms

OpenTelemetry Implementation

const { trace } = require('@opentelemetry/api');

const tracer = trace.getTracer('order-service');

async function processOrder(orderId) {
  return tracer.startActiveSpan('process-order', async (span) => {
    span.setAttribute('order.id', orderId);

    await checkInventory(orderId);
    await processPayment(orderId);
    await sendNotification(orderId);

    span.setStatus({ code: SpanStatusCode.OK });
    span.end();
  });
}

Connecting the Three Pillars

The real power comes from correlating all three signals:

A metric alert fires: p99 latency > 500ms
Traces reveal the slow span: payment-service database query
Logs from that span show: “Connection pool exhausted, waiting for available connection”

This correlation requires a shared trace ID that flows through all three signals.

Building Your Observability Stack

Component	Open Source	Commercial
Metrics	Prometheus + Grafana	Datadog, New Relic
Logs	Grafana Loki, ELK Stack	Splunk, Datadog
Traces	Jaeger, Zipkin	Datadog, Honeycomb
All-in-one	Grafana Stack	Datadog, Dynatrace

Conclusion

Observability is not a product you buy - it’s a property of your system that you build. Start by instrumenting your most critical services with all three pillars, and gradually expand coverage.