Skip to main content

Observability Concepts

Introduction

Observability is a crucial aspect of modern system management that goes beyond traditional monitoring. While monitoring tells you when something is wrong, observability helps you understand why it's happening. In this guide, we'll explore the core concepts of observability, how they differ from traditional monitoring, and how Grafana helps implement observability in your systems.

What is Observability?

Observability originated from control theory and refers to how well you can understand a system's internal states based on its external outputs. In software and infrastructure contexts, observability means having enough data about your systems that you can:

  1. Track the current health of your systems
  2. Troubleshoot when things go wrong
  3. Understand performance and behavior patterns
  4. Make data-driven decisions for improvements

A highly observable system provides insights without requiring additional code deployments to investigate issues.

The Three Pillars of Observability

Observability is commonly built on three fundamental data types:

1. Metrics

Metrics are numerical measurements collected at regular intervals. They are typically time-series data that represent the state or performance of your system.

Characteristics of metrics:

  • Highly structured and aggregatable
  • Low storage requirements
  • Fast querying for dashboards and alerts
  • Good for known patterns and trends

Example metrics in Grafana:

# CPU usage metric
cpu_usage{instance="server-01", job="node"} 0.65

# Memory utilization
memory_used_bytes{instance="server-01", job="node"} 4096000000

# HTTP request count
http_requests_total{method="GET", endpoint="/api/users", status="200"} 1234

2. Logs

Logs are text-based records of events that occurred within your application or system. They provide a detailed narrative of what happened and when.

Characteristics of logs:

  • Semi-structured or unstructured text data
  • Higher storage requirements than metrics
  • Valuable for detailed context and debugging
  • Best for investigating known issues

Example log entry:

2023-08-15T14:22:31.142Z INFO [UserService] User login successful user_id=1234 source_ip=192.168.1.105 session_id=abcd1234

3. Traces

Traces track the journey of a request as it moves through a distributed system, helping you understand the end-to-end flow and identify bottlenecks.

Characteristics of traces:

  • Show causal relationships between services
  • Critical for microservice architectures
  • Help identify performance bottlenecks
  • Provide context for complex system interactions

Example trace data visualization in Grafana:

Observability vs. Monitoring

While related, observability and monitoring serve different purposes:

MonitoringObservability
Tells you when something is brokenHelps you understand why it's broken
Based on predefined metrics and thresholdsAllows for exploring unknown issues
Reactive approachProactive approach
Focuses on known failure modesHelps discover unknown failure modes
Dashboard-centricQuery and exploration-centric

Implementing Observability with Grafana

Grafana serves as an observability platform by providing:

1. Data Source Integration

Grafana connects to various data sources for each pillar:

  • Metrics: Prometheus, InfluxDB, Graphite
  • Logs: Loki, Elasticsearch
  • Traces: Tempo, Jaeger, Zipkin

2. Unified Dashboard Experience

Grafana allows you to visualize all three pillars on a single dashboard, creating a unified observability experience:

javascript
// Example Grafana panel configuration for a metric
{
"datasource": "Prometheus",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service)",
"legendFormat": "{{service}}"
}
],
"title": "HTTP 5xx Error Rate by Service"
}

3. Correlation Between Data Types

One of Grafana's key strengths is correlating different data types. For example, seeing a spike in an error metric, then exploring logs from that time period, and finally examining traces of failed requests creates a powerful debugging workflow.

Implementing Basic Observability

Let's walk through setting up basic observability for a web application:

Step 1: Instrument Your Application for Metrics

Use a client library appropriate for your language to capture metrics:

javascript
// Node.js example with Prometheus client
const promClient = require('prom-client');

// Create a counter for HTTP requests
const httpRequestsTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'status', 'path']
});

// In your request handler
app.use((req, res, next) => {
res.on('finish', () => {
httpRequestsTotal.inc({
method: req.method,
status: res.statusCode,
path: req.route?.path || 'unknown'
});
});
next();
});

Step 2: Set Up Structured Logging

Implement structured logging to make your logs more queryable:

javascript
// Node.js example with Winston logger
const winston = require('winston');

const logger = winston.createLogger({
format: winston.format.json(),
defaultMeta: { service: 'user-service' },
transports: [
new winston.transports.Console(),
new winston.transports.File({ filename: 'combined.log' })
]
});

// Usage
logger.info('User login attempt', {
userId: '1234',
sourceIp: req.ip,
success: true,
latencyMs: 45
});

Step 3: Implement Distributed Tracing

Add tracing to track requests across services:

javascript
// Node.js example with OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/node');
const { SimpleSpanProcessor } = require('@opentelemetry/tracing');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');

// Set up the tracer
const provider = new NodeTracerProvider();
const exporter = new JaegerExporter({
serviceName: 'my-service',
});
provider.addSpanProcessor(new SimpleSpanProcessor(exporter));
provider.register();

// Get a tracer
const tracer = provider.getTracer('my-service-tracer');

// Create spans for operations
async function handleRequest(req, res) {
const span = tracer.startSpan('handleRequest');
try {
// Set attributes on the span
span.setAttribute('http.method', req.method);
span.setAttribute('http.url', req.url);

// Create a child span for database operation
const dbSpan = tracer.startSpan('database.query', {
parent: span
});
const result = await database.query('SELECT * FROM users');
dbSpan.end();

res.json(result);
} catch (error) {
// Record errors
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
res.status(500).send('Internal error');
} finally {
span.end();
}
}

Step 4: Configure Grafana Dashboards

Create dashboards that visualize your service health:

  1. Service Overview Dashboard: Key metrics like request rate, error rate, and latency
  2. Log Explorer Dashboard: Filter and search logs
  3. Trace Analysis Dashboard: View and analyze distributed traces

Observability Best Practices

Follow these guidelines for effective observability:

1. Instrument at the Right Level

  • Avoid excessive instrumentation that creates noise
  • Focus on service boundaries and critical paths
  • Instrument both infrastructure and applications

2. Use Consistent Naming and Labels

javascript
// Good: Consistent naming and labeling
http_requests_total{service="auth", endpoint="/login", method="POST", status="200"}

// Bad: Inconsistent naming and labels
http_reqs{srv="auth", url="/login", type="POST", code="200"}

3. Implement Context Propagation

Ensure that context (like trace IDs) flows between services:

javascript
// Example of propagating trace context in HTTP headers
function makeDownstreamRequest(parentSpan) {
const headers = {};
tracer.inject(parentSpan.context(), opentelemetry.FORMAT_HTTP_HEADERS, headers);

return axios.get('https://api.example.com/data', { headers });
}

4. Set Up Alerts on SLIs/SLOs

Base alerts on Service Level Indicators (SLIs) and Objectives (SLOs) rather than raw metrics:

yaml
# Example Prometheus alert based on SLO
groups:
- name: SLO Alerts
rules:
- alert: APIErrorBudgetBurning
expr: sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "API error budget burning too fast"
description: "Error rate of {{ $value | humanizePercentage }} exceeds 1% threshold"

Advanced Observability Concepts

Exemplars

Exemplars link metrics to traces, allowing you to jump from a metric spike directly to trace samples that contributed to that spike.

# Metric with exemplar
http_request_duration_seconds_bucket{le="0.1",status="200"} 12345 # {trace_id="abcdef123456"}

High Cardinality Data

Be mindful of high-cardinality data (metrics with many unique label combinations), as they can impact performance:

# High cardinality (problematic)
http_requests_total{user_id="12345", session_id="abcdef", ...}

# Better approach
http_requests_total{service="auth", endpoint="/login"}

# Track high cardinality data in logs or traces instead

RED Method

A pattern for monitoring microservices:

  • Rate: Requests per second
  • Error rate: Failed requests per second
  • Duration: Distribution of response times

USE Method

A pattern for monitoring resources:

  • Utilization: Percentage of time the resource is busy
  • Saturation: Amount of work the resource has to do
  • Errors: Count of error events

Summary

Observability extends traditional monitoring by providing deeper insights into system behavior. Through the three pillars—metrics, logs, and traces—you gain a comprehensive understanding of your systems that enables faster troubleshooting and more informed decision-making.

Grafana serves as a powerful platform for implementing observability by integrating these pillars into a unified experience. By following the practices outlined in this guide, you'll be well on your way to building more observable and reliable systems.

Next Steps

  • Practice Exercise: Instrument a simple application with metrics, logs, and traces
  • Challenge: Create a Grafana dashboard that correlates all three pillars for a specific service
  • Explore: Experiment with different data sources in Grafana to understand their strengths and limitations

Additional Resources



If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)