Tracing Integration
Introduction
Distributed tracing is a powerful observability technique that helps you understand how requests flow through your applications, especially in microservice architectures. In this guide, we'll explore how Grafana integrates with tracing systems to provide end-to-end visibility into your application performance.
Tracing complements metrics and logs by showing:
- The path of a request across services
- Performance bottlenecks
- Dependencies between components
- Exactly where errors occur
By the end of this tutorial, you'll understand how to set up, visualize, and analyze traces in Grafana to troubleshoot complex issues in your applications.
Understanding Distributed Tracing
Before diving into Grafana's tracing capabilities, let's establish some core concepts:
Key Tracing Concepts
- Trace: A record of a request's journey through your system
- Span: A single operation within a trace, representing work done in a service
- Span Context: Metadata that identifies a span and its position in the trace
- Parent/Child Relationship: Shows how spans relate to each other hierarchically
- Service Graph: Visualization of service dependencies derived from trace data
Grafana Tracing Data Sources
Grafana supports multiple tracing backends:
- Tempo: Grafana's native tracing backend
- Jaeger: A popular open-source tracing system
- Zipkin: Another widely-used tracing solution
- OpenTelemetry: The emerging standard for observability data
Let's set up a Tempo data source as an example.
Setting Up Tempo in Grafana
- Navigate to Configuration → Data Sources in Grafana
- Click Add data source
- Select Tempo
- Configure the connection:
URL: http://tempo:3100
For basic setup, you can use the default values for the remaining fields.
Exploring Traces in Grafana
Once your data source is configured, you can start exploring traces.
Trace Explorer
The Trace Explorer is the main interface for searching and analyzing traces:
- Go to Explore in the left sidebar
- Select your tracing data source from the dropdown
- Use the search interface to find traces:
- By trace ID
- By service name
- By duration
- By tags and attributes
Analyzing a Trace
When you view a trace, you'll see a waterfall visualization showing:
- All spans in chronological order
- Duration of each span
- Parent-child relationships
- Service boundaries
- Error indicators
Understanding Trace Details
When you select a span within a trace, you can see detailed information:
- Operation: The specific action being performed
- Duration: How long the span took to execute
- Tags/Attributes: Key-value pairs with additional context
- Logs: Events that occurred during the span
- Process Information: Details about the service that generated the span
Integrating Traces with Logs and Metrics
Grafana's power comes from connecting different types of observability data.
Trace to Logs
You can configure Grafana to jump from a trace span directly to relevant logs:
{
"datasourceUid": "loki",
"tags": [
{ "key": "service.name", "value": "service" },
{ "key": "trace_id", "value": "traceID" }
],
"spanStartTimeShift": "1h",
"spanEndTimeShift": "1h"
}
This configuration tells Grafana how to construct a Loki query from trace data.
Trace to Metrics
Similarly, you can link traces to relevant metrics:
{
"datasourceUid": "prometheus",
"queries": [
{
"name": "Request Rate",
"query": 'rate(http_requests_total{service="$service"}[$__interval])'
}
],
"tags": [
{ "key": "service.name", "value": "service" }
]
}
Practical Example: Troubleshooting High Latency
Let's walk through a real-world example of using traces to find performance issues.
Scenario
Users are reporting slow checkout in your e-commerce application.
Step 1: Identify Problematic Traces
- Go to Explore and select your tracing data source
- Search for traces with the operation
checkout
and duration > 1s - Examine the results to find traces with unusually long durations
Step 2: Analyze the Trace
Looking at a slow trace, you might see:
// Trace visualization shows:
// checkout (1.2s)
// ├─ validateCart (50ms)
// ├─ processPayment (950ms) <-- Suspiciously slow!
// │ └─ paymentGateway.authorize (920ms)
// └─ createOrder (200ms)
Step 3: Drill Down into the Slow Span
Click on the processPayment
span to see details:
- The span took 950ms, much longer than normal
- The tags show it's calling an external payment provider
- Network attributes show high latency
Step 4: Correlate with Metrics and Logs
- Click on "View Logs" to see relevant logs during this time
- Check metrics for the payment service to see if this is a pattern
Step 5: Identify and Fix the Root Cause
In this case, you might discover:
- The payment provider API is experiencing slowdowns
- Your connection pool settings are too restrictive
- The solution might be to implement a retry mechanism or circuit breaker
Setting Up Application Instrumentation
To get traces from your applications, they need to be instrumented.
Using OpenTelemetry (Modern Approach)
OpenTelemetry provides a vendor-neutral way to instrument applications:
// Example Node.js instrumentation with OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-otlp-http');
// Create and register the tracer provider
const provider = new NodeTracerProvider();
provider.register();
// Set up auto-instrumentation
registerInstrumentations({
instrumentations: [
new HttpInstrumentation(),
],
});
// Configure exporter to send to Tempo
const exporter = new OTLPTraceExporter({
url: 'http://tempo:4318/v1/traces',
});
// Connect tracer to exporter
const processor = new BatchSpanProcessor(exporter);
provider.addSpanProcessor(processor);
Manual Instrumentation Example
For fine-grained control, you can add custom spans:
// Getting a tracer
const tracer = opentelemetry.trace.getTracer('checkout-service');
// Creating a custom span
async function processOrder(orderId) {
const span = tracer.startSpan('process-order');
// Add context to the span
span.setAttribute('order.id', orderId);
try {
// Your business logic here
const result = await validateOrder(orderId);
span.setAttribute('order.status', result.status);
return result;
} catch (error) {
// Record errors
span.recordException(error);
span.setStatus({
code: opentelemetry.SpanStatusCode.ERROR,
message: error.message
});
throw error;
} finally {
// Always end the span
span.end();
}
}
Advanced Tracing Features in Grafana
Service Graphs
Grafana can generate service dependency maps from your trace data:
- Go to Explore and select your tracing data source
- Click on the Service Graph tab
- View an interactive map showing services and their dependencies
- Hover over connections to see request rates and error percentages
Trace Analytics
Tempo and Grafana provide analytical views of trace data:
- Latency Histograms: Visualize the distribution of request durations
- Error Rates: Monitor failure percentages across services
- Span Statistics: Analyze performance patterns by operation
Example Grafana query to visualize trace latency:
{
// Tempo query
"queryType": "spans",
"filters": [
{ "tag": "service.name", "operator": "=", "value": "checkout-service" },
{ "tag": "operation", "operator": "=", "value": "processOrder" }
],
"groupBy": { "tag": "status.code" },
"calculations": ["count", "p50", "p95", "p99"]
}
Best Practices for Tracing in Grafana
- Sample Wisely: In high-volume systems, trace only a percentage of requests
- Focus on Critical Paths: Ensure key user journeys are well-instrumented
- Add Business Context: Include business identifiers (order IDs, user IDs) as span attributes
- Use Consistent Naming: Adopt a naming convention for services and operations
- Link All Observability Data: Connect traces with logs and metrics
- Set Appropriate Retention: Configure retention based on your troubleshooting needs
Summary
In this guide, we've explored how to integrate tracing with Grafana to gain deeper insights into your applications:
- We learned about distributed tracing concepts
- We set up a tracing data source in Grafana
- We explored how to analyze traces to find performance issues
- We connected traces with logs and metrics
- We saw how to instrument applications to generate trace data
- We explored advanced features like service graphs and analytics
Tracing integration completes the observability triad alongside metrics and logs, giving you a comprehensive view of your system's behavior and performance.
Additional Resources
- Grafana Tempo Documentation
- OpenTelemetry Documentation
- Distributed Tracing: A Complete Guide
- Grafana Labs Webinar: Tracing Made Simple with Grafana
Exercises
- Set up a Tempo or Jaeger data source in your Grafana instance.
- Instrument a simple application using OpenTelemetry and send traces to Grafana.
- Create a dashboard that shows both metrics and trace data for a service.
- Use the Trace Explorer to find the slowest operations in your application.
- Configure trace-to-logs correlation to jump from spans to relevant log entries.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)