Tracing Integration

Introduction

Distributed tracing is a powerful observability technique that helps you understand how requests flow through your applications, especially in microservice architectures. In this guide, we'll explore how Grafana integrates with tracing systems to provide end-to-end visibility into your application performance.

Tracing complements metrics and logs by showing:

The path of a request across services
Performance bottlenecks
Dependencies between components
Exactly where errors occur

By the end of this tutorial, you'll understand how to set up, visualize, and analyze traces in Grafana to troubleshoot complex issues in your applications.

Understanding Distributed Tracing

Before diving into Grafana's tracing capabilities, let's establish some core concepts:

Key Tracing Concepts

Trace: A record of a request's journey through your system
Span: A single operation within a trace, representing work done in a service
Span Context: Metadata that identifies a span and its position in the trace
Parent/Child Relationship: Shows how spans relate to each other hierarchically
Service Graph: Visualization of service dependencies derived from trace data

Grafana Tracing Data Sources

Grafana supports multiple tracing backends:

Tempo: Grafana's native tracing backend
Jaeger: A popular open-source tracing system
Zipkin: Another widely-used tracing solution
OpenTelemetry: The emerging standard for observability data

Let's set up a Tempo data source as an example.

Setting Up Tempo in Grafana

Navigate to Configuration → Data Sources in Grafana
Click Add data source
Select Tempo
Configure the connection:

URL: http://tempo:3100

For basic setup, you can use the default values for the remaining fields.

Exploring Traces in Grafana

Once your data source is configured, you can start exploring traces.

Trace Explorer

The Trace Explorer is the main interface for searching and analyzing traces:

Go to Explore in the left sidebar
Select your tracing data source from the dropdown
Use the search interface to find traces:
- By trace ID
- By service name
- By duration
- By tags and attributes

Analyzing a Trace

When you view a trace, you'll see a waterfall visualization showing:

All spans in chronological order
Duration of each span
Parent-child relationships
Service boundaries
Error indicators

Understanding Trace Details

When you select a span within a trace, you can see detailed information:

Operation: The specific action being performed
Duration: How long the span took to execute
Tags/Attributes: Key-value pairs with additional context
Logs: Events that occurred during the span
Process Information: Details about the service that generated the span

Integrating Traces with Logs and Metrics

Grafana's power comes from connecting different types of observability data.

Trace to Logs

You can configure Grafana to jump from a trace span directly to relevant logs:

{
  "datasourceUid": "loki",
  "tags": [
    { "key": "service.name", "value": "service" },
    { "key": "trace_id", "value": "traceID" }
  ],
  "spanStartTimeShift": "1h",
  "spanEndTimeShift": "1h"
}

This configuration tells Grafana how to construct a Loki query from trace data.

Trace to Metrics

Similarly, you can link traces to relevant metrics:

{
  "datasourceUid": "prometheus",
  "queries": [
    {
      "name": "Request Rate",
      "query": 'rate(http_requests_total{service="$service"}[$__interval])'
    }
  ],
  "tags": [
    { "key": "service.name", "value": "service" }
  ]
}

Practical Example: Troubleshooting High Latency

Let's walk through a real-world example of using traces to find performance issues.

Scenario

Users are reporting slow checkout in your e-commerce application.

Step 1: Identify Problematic Traces

Go to Explore and select your tracing data source
Search for traces with the operation checkout and duration > 1s
Examine the results to find traces with unusually long durations

Step 2: Analyze the Trace

Looking at a slow trace, you might see:

// Trace visualization shows:
// checkout (1.2s)
//   ├─ validateCart (50ms)
//   ├─ processPayment (950ms)  <-- Suspiciously slow!
//   │   └─ paymentGateway.authorize (920ms)
//   └─ createOrder (200ms)

Step 3: Drill Down into the Slow Span

Click on the processPayment span to see details:

The span took 950ms, much longer than normal
The tags show it's calling an external payment provider
Network attributes show high latency

Step 4: Correlate with Metrics and Logs

Click on "View Logs" to see relevant logs during this time
Check metrics for the payment service to see if this is a pattern

Step 5: Identify and Fix the Root Cause

In this case, you might discover:

The payment provider API is experiencing slowdowns
Your connection pool settings are too restrictive
The solution might be to implement a retry mechanism or circuit breaker

Setting Up Application Instrumentation

To get traces from your applications, they need to be instrumented.

Using OpenTelemetry (Modern Approach)

OpenTelemetry provides a vendor-neutral way to instrument applications:

// Example Node.js instrumentation with OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-otlp-http');

// Create and register the tracer provider
const provider = new NodeTracerProvider();
provider.register();

// Set up auto-instrumentation
registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
  ],
});

// Configure exporter to send to Tempo
const exporter = new OTLPTraceExporter({
  url: 'http://tempo:4318/v1/traces',
});

// Connect tracer to exporter
const processor = new BatchSpanProcessor(exporter);
provider.addSpanProcessor(processor);

Manual Instrumentation Example

For fine-grained control, you can add custom spans:

// Getting a tracer
const tracer = opentelemetry.trace.getTracer('checkout-service');

// Creating a custom span
async function processOrder(orderId) {
  const span = tracer.startSpan('process-order');
  
  // Add context to the span
  span.setAttribute('order.id', orderId);
  
  try {
    // Your business logic here
    const result = await validateOrder(orderId);
    span.setAttribute('order.status', result.status);
    
    return result;
  } catch (error) {
    // Record errors
    span.recordException(error);
    span.setStatus({
      code: opentelemetry.SpanStatusCode.ERROR,
      message: error.message
    });
    throw error;
  } finally {
    // Always end the span
    span.end();
  }
}

Advanced Tracing Features in Grafana

Service Graphs

Grafana can generate service dependency maps from your trace data:

Go to Explore and select your tracing data source
Click on the Service Graph tab
View an interactive map showing services and their dependencies
Hover over connections to see request rates and error percentages

Trace Analytics

Tempo and Grafana provide analytical views of trace data:

Latency Histograms: Visualize the distribution of request durations
Error Rates: Monitor failure percentages across services
Span Statistics: Analyze performance patterns by operation

Example Grafana query to visualize trace latency:

{
  // Tempo query
  "queryType": "spans",
  "filters": [
    { "tag": "service.name", "operator": "=", "value": "checkout-service" },
    { "tag": "operation", "operator": "=", "value": "processOrder" }
  ],
  "groupBy": { "tag": "status.code" },
  "calculations": ["count", "p50", "p95", "p99"]
}

Best Practices for Tracing in Grafana

Sample Wisely: In high-volume systems, trace only a percentage of requests
Focus on Critical Paths: Ensure key user journeys are well-instrumented
Add Business Context: Include business identifiers (order IDs, user IDs) as span attributes
Use Consistent Naming: Adopt a naming convention for services and operations
Link All Observability Data: Connect traces with logs and metrics
Set Appropriate Retention: Configure retention based on your troubleshooting needs

Summary

In this guide, we've explored how to integrate tracing with Grafana to gain deeper insights into your applications:

We learned about distributed tracing concepts
We set up a tracing data source in Grafana
We explored how to analyze traces to find performance issues
We connected traces with logs and metrics
We saw how to instrument applications to generate trace data
We explored advanced features like service graphs and analytics

Tracing integration completes the observability triad alongside metrics and logs, giving you a comprehensive view of your system's behavior and performance.

Additional Resources

Exercises

Set up a Tempo or Jaeger data source in your Grafana instance.
Instrument a simple application using OpenTelemetry and send traces to Grafana.
Create a dashboard that shows both metrics and trace data for a service.
Use the Trace Explorer to find the slowest operations in your application.
Configure trace-to-logs correlation to jump from spans to relevant log entries.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Distributed Tracing​

Key Tracing Concepts​

Grafana Tracing Data Sources​

Setting Up Tempo in Grafana​

Exploring Traces in Grafana​

Trace Explorer​

Analyzing a Trace​

Understanding Trace Details​

Integrating Traces with Logs and Metrics​

Trace to Logs​

Trace to Metrics​

Practical Example: Troubleshooting High Latency​

Scenario​

Step 1: Identify Problematic Traces​

Step 2: Analyze the Trace​

Step 3: Drill Down into the Slow Span​

Step 4: Correlate with Metrics and Logs​

Step 5: Identify and Fix the Root Cause​

Setting Up Application Instrumentation​

Using OpenTelemetry (Modern Approach)​

Manual Instrumentation Example​

Advanced Tracing Features in Grafana​

Service Graphs​

Trace Analytics​

Best Practices for Tracing in Grafana​

Summary​

Additional Resources​

Exercises​