The Three Pillars of Observability Explained
Beyond Monitoring: Understanding Observability
Monitoring tells you when something is wrong. Observability tells you why.
In a world of microservices and distributed systems, you need the ability to ask arbitrary questions about your system’s behavior without deploying new code. That’s observability.
Pillar 1: Logs
Logs are timestamped, discrete events that record what happened in your system.
Structured Logging Best Practices
{
"timestamp": "2026-03-08T14:30:00Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "abc123def456",
"span_id": "span789",
"user_id": "user_42",
"message": "Payment processing failed",
"error": "insufficient_funds",
"amount": 99.99,
"currency": "EUR",
"duration_ms": 234
}
Key Practices
- Always use structured logging (JSON) over plain text
- Include trace IDs for correlation across services
- Use appropriate log levels (DEBUG, INFO, WARN, ERROR)
- Add business context (user ID, order ID, amounts)
- Centralize logs with tools like ELK Stack or Grafana Loki
Pillar 2: Metrics
Metrics are numeric values measured over time, aggregated into time-series data.
The RED Method (for services)
- Rate: Number of requests per second
- Errors: Number of failed requests per second
- Duration: Distribution of response times
The USE Method (for resources)
- Utilization: Average time resource was busy
- Saturation: Amount of work queued
- Errors: Count of error events
Example: Prometheus Metrics
from prometheus_client import Counter, Histogram
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_DURATION = Histogram(
'http_request_duration_seconds',
'HTTP request duration in seconds',
['method', 'endpoint'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5]
)
Pillar 3: Traces
Distributed traces follow a request’s journey through multiple services.
Anatomy of a Trace
A trace consists of spans - named, timed operations representing a unit of work:
[Trace: order-checkout]
├── [Span: API Gateway] 2ms
│ ├── [Span: Auth Service] 5ms
│ └── [Span: Order Service] 150ms
│ ├── [Span: Inventory Check] 30ms
│ ├── [Span: Payment Service] 80ms
│ │ └── [Span: External Payment API] 60ms
│ └── [Span: Notification Service] 20ms
OpenTelemetry Implementation
const { trace } = require('@opentelemetry/api');
const tracer = trace.getTracer('order-service');
async function processOrder(orderId) {
return tracer.startActiveSpan('process-order', async (span) => {
span.setAttribute('order.id', orderId);
await checkInventory(orderId);
await processPayment(orderId);
await sendNotification(orderId);
span.setStatus({ code: SpanStatusCode.OK });
span.end();
});
}
Connecting the Three Pillars
The real power comes from correlating all three signals:
- A metric alert fires: p99 latency > 500ms
- Traces reveal the slow span: payment-service database query
- Logs from that span show: “Connection pool exhausted, waiting for available connection”
This correlation requires a shared trace ID that flows through all three signals.
Building Your Observability Stack
| Component | Open Source | Commercial |
|---|---|---|
| Metrics | Prometheus + Grafana | Datadog, New Relic |
| Logs | Grafana Loki, ELK Stack | Splunk, Datadog |
| Traces | Jaeger, Zipkin | Datadog, Honeycomb |
| All-in-one | Grafana Stack | Datadog, Dynatrace |
Conclusion
Observability is not a product you buy - it’s a property of your system that you build. Start by instrumenting your most critical services with all three pillars, and gradually expand coverage.