Exporter Best Practices

Introduction

Prometheus exporters are specialized applications that collect metrics from existing systems and expose them in a format that Prometheus can scrape. While creating an exporter might seem straightforward, following established best practices ensures your exporters are reliable, maintainable, and provide valuable metrics. This guide covers essential best practices for developing and deploying Prometheus exporters, helping you avoid common pitfalls and optimize your monitoring infrastructure.

Naming Conventions

Metric Naming

Consistent metric naming is crucial for clarity and usability:

Use a format of namespace_subsystem_name (e.g., http_requests_total)
Choose a relevant namespace that represents the monitored system (e.g., node, mysql, http)
Use snake_case for metric names
Include units in the metric name (e.g., http_request_duration_seconds)
Use suffixes like _total, _count, _sum, _bucket for appropriate metric types

go
// Good example
prometheus.NewCounter(prometheus.CounterOpts{
    Name: "http_requests_total",
    Help: "Total number of HTTP requests processed",
})

// Poor example - inconsistent naming
prometheus.NewCounter(prometheus.CounterOpts{
    Name: "HTTPReq-Count",
    Help: "Requests",
})

Label Naming

Labels allow you to add dimensions to your metrics, but should follow these conventions:

Use snake_case for label names
Keep label names short but descriptive
Avoid high-cardinality labels (those with unlimited or very high numbers of potential values)
Common label names: environment, instance, job, service, endpoint

go
// Good example
prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total number of HTTP requests processed",
    },
    []string{"method", "endpoint", "status_code"},
)

// Poor example - high cardinality
prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total number of HTTP requests processed",
    },
    []string{"user_id", "precise_timestamp", "session_id"},
)

Metric Types

Choose the appropriate metric type based on what you're measuring:

Counter

Use for metrics that only increase over time (e.g., total requests, errors)
Never decrease a counter (except during restart)
Name them with suffixes like _total

go
requestsTotal := prometheus.NewCounter(prometheus.CounterOpts{
    Name: "http_requests_total",
    Help: "Total number of HTTP requests processed",
})

// Increment counter when handling a request
requestsTotal.Inc()

Gauge

Use for metrics that can go up and down (e.g., current memory usage, connection pool size)
Represent a snapshot of current state

go
activeConnections := prometheus.NewGauge(prometheus.CounterOpts{
    Name: "http_active_connections",
    Help: "Current number of active HTTP connections",
})

// Set or adjust gauge based on current state
activeConnections.Set(float64(connectionPool.Size()))
activeConnections.Inc() // Increment by 1
activeConnections.Dec() // Decrement by 1

Histogram

Use for measuring distributions of values (e.g., request duration, response size)
Creates multiple time series: count, sum, and buckets
Define appropriate buckets for your use case

go
requestDuration := prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Help:    "HTTP request duration distribution",
    Buckets: prometheus.DefBuckets, // Default: [.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
})

// Record the duration of a request
func handleRequest() {
    start := time.Now()
    // ... handle request ...
    duration := time.Since(start).Seconds()
    requestDuration.Observe(duration)
}

Summary

Similar to histograms but calculates quantiles directly
Use when exact quantile values are needed
More expensive to compute than histograms

go
requestDuration := prometheus.NewSummary(prometheus.SummaryOpts{
    Name:       "http_request_duration_seconds",
    Help:       "HTTP request duration distribution",
    Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
})

// Record the duration of a request
func handleRequest() {
    start := time.Now()
    // ... handle request ...
    duration := time.Since(start).Seconds()
    requestDuration.Observe(duration)
}

Performance Considerations

Resource Usage

Exporters should be lightweight and efficient:

Minimize CPU and memory usage
Use caching when appropriate
Consider the frequency of metrics collection

go
// Inefficient - collects metrics on every scrape
http.HandleFunc("/metrics", func(w http.ResponseWriter, r *http.Request) {
    collectExpensiveMetrics()
    // ...
})

// Better - uses caching
var (
    metricsMutex     sync.RWMutex
    lastCollectTime  time.Time
    cachedMetrics    []byte
    cacheValidPeriod = 30 * time.Second
)

http.HandleFunc("/metrics", func(w http.ResponseWriter, r *http.Request) {
    metricsMutex.RLock()
    if time.Since(lastCollectTime) < cacheValidPeriod {
        w.Write(cachedMetrics)
        metricsMutex.RUnlock()
        return
    }
    metricsMutex.RUnlock()
    
    metricsMutex.Lock()
    defer metricsMutex.Unlock()
    
    // Double-check after acquiring write lock
    if time.Since(lastCollectTime) < cacheValidPeriod {
        w.Write(cachedMetrics)
        return
    }
    
    // Collect and cache metrics
    metrics := collectExpensiveMetrics()
    cachedMetrics = metrics
    lastCollectTime = time.Now()
    
    w.Write(metrics)
})

Cardinality

High cardinality can cause performance issues in Prometheus:

Limit the number of label combinations
Avoid using unique identifiers as label values
Consider aggregating high-cardinality metrics at the exporter level

go
// Bad - high cardinality
prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "user_login_total",
        Help: "User login attempts",
    },
    []string{"username", "ip_address", "session_id"}, // Can generate millions of time series!
)

// Better - lower cardinality
prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "user_login_total",
        Help: "User login attempts",
    },
    []string{"success", "authentication_method"}, // Much more manageable
)

Reliability and Availability

Error Handling

Exporters should be robust against errors:

Handle connection failures gracefully
Log errors without crashing
Return partial metrics when possible
Set default or fallback values when metrics collection fails

go
func collectDatabaseMetrics() {
    var dbConn *sql.DB
    var err error
    
    // Try to connect with timeout
    connectChan := make(chan bool)
    go func() {
        dbConn, err = sql.Open("mysql", "user:password@/dbname")
        connectChan <- true
    }()
    
    select {
    case <-connectChan:
        if err != nil {
            log.Printf("Failed to connect to database: %v", err)
            databaseUp.Set(0) // Indicate database is down
            return
        }
    case <-time.After(5 * time.Second):
        log.Printf("Database connection timeout")
        databaseUp.Set(0)
        return
    }
    
    defer dbConn.Close()
    databaseUp.Set(1) // Indicate database is up
    
    // Continue collecting metrics...
}

Timeouts

Implement timeouts to prevent hanging:

Set connection timeouts
Add request timeouts
Use context for cancellation

go
func collectMetricsWithTimeout(ctx context.Context) ([]byte, error) {
    resultChan := make(chan []byte)
    errChan := make(chan error)
    
    go func() {
        metrics, err := collectMetrics()
        if err != nil {
            errChan <- err
            return
        }
        resultChan <- metrics
    }()
    
    select {
    case result := <-resultChan:
        return result, nil
    case err := <-errChan:
        return nil, err
    case <-ctx.Done():
        return nil, fmt.Errorf("metrics collection timed out")
    }
}

http.HandleFunc("/metrics", func(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
    defer cancel()
    
    metrics, err := collectMetricsWithTimeout(ctx)
    if err != nil {
        http.Error(w, fmt.Sprintf("Metrics collection failed: %v", err), http.StatusInternalServerError)
        return
    }
    
    w.Write(metrics)
})

Operational Considerations

Self-Monitoring

Exporters should expose their own operational metrics:

Process metrics (memory, CPU usage)
Collection duration and errors
Scrape counts and durations

go
var (
    scrapeDuration = prometheus.NewSummary(prometheus.SummaryOpts{
        Name: "exporter_scrape_duration_seconds",
        Help: "Duration of metrics collection",
    })
    
    scrapeErrors = prometheus.NewCounter(prometheus.CounterOpts{
        Name: "exporter_scrape_errors_total",
        Help: "Total number of errors during metrics collection",
    })
    
    lastScrapeTime = prometheus.NewGauge(prometheus.GaugeOpts{
        Name: "exporter_last_scrape_timestamp_seconds",
        Help: "Unix timestamp of the last successful scrape",
    })
)

func init() {
    // Register self-monitoring metrics
    prometheus.MustRegister(scrapeDuration)
    prometheus.MustRegister(scrapeErrors)
    prometheus.MustRegister(lastScrapeTime)
}

func collectAndExposeMetrics(w http.ResponseWriter) {
    start := time.Now()
    
    err := collectMetrics()
    if err != nil {
        scrapeErrors.Inc()
        http.Error(w, fmt.Sprintf("Error collecting metrics: %v", err), http.StatusInternalServerError)
        return
    }
    
    duration := time.Since(start).Seconds()
    scrapeDuration.Observe(duration)
    lastScrapeTime.Set(float64(time.Now().Unix()))
    
    promhttp.Handler().ServeHTTP(w, r)
}

Configuration

Design exporters to be easily configurable:

Support configuration via flags, environment variables, and configuration files
Provide reasonable defaults
Document all configuration options
Allow runtime reconfiguration when possible

go
// Example of flag-based configuration
package main

import (
    "flag"
    "fmt"
    "net/http"
    "os"
    "time"
    
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    listenAddress     = flag.String("web.listen-address", ":9100", "Address to listen on for web interface and telemetry")
    metricsPath       = flag.String("web.telemetry-path", "/metrics", "Path under which to expose metrics")
    collectInterval   = flag.Duration("collector.interval", 15*time.Second, "Interval for cached metric collection")
    timeoutDuration   = flag.Duration("collector.timeout", 5*time.Second, "Timeout for metrics collection")
    logLevel          = flag.String("log.level", "info", "Log level: debug, info, warn, error")
)

func main() {
    flag.Parse()
    
    // Use environment variables as overrides
    if envAddress := os.Getenv("EXPORTER_WEB_LISTEN_ADDRESS"); envAddress != "" {
        *listenAddress = envAddress
    }
    
    // ... set up metrics and collectors ...
    
    http.Handle(*metricsPath, promhttp.Handler())
    http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte(`<html>
            <head><title>Example Exporter</title></head>
            <body>
            <h1>Example Exporter</h1>
            <p><a href="` + *metricsPath + `">Metrics</a></p>
            </body>
            </html>`))
    })
    
    fmt.Printf("Starting exporter on %s
", *listenAddress)
    if err := http.ListenAndServe(*listenAddress, nil); err != nil {
        fmt.Printf("Error starting exporter: %s
", err)
        os.Exit(1)
    }
}

Documentation

Metric Documentation

Provide clear documentation for all metrics:

Include descriptive help strings for all metrics
Document units in both the metric name and help text
Explain what each metric represents and how to interpret it
Provide example Prometheus queries for common use cases

go
// Good metric documentation
prometheus.NewCounter(prometheus.CounterOpts{
    Name: "http_requests_total",
    Help: "Total number of HTTP requests processed, regardless of outcome",
})

prometheus.NewGauge(prometheus.GaugeOpts{
    Name: "node_memory_usage_bytes",
    Help: "Current memory usage in bytes, including buffers/cache",
})

prometheus.NewHistogram(prometheus.HistogramOpts{
    Name: "http_request_duration_seconds",
    Help: "HTTP request latency distribution in seconds. Includes time spent waiting for response from backend.",
    Buckets: prometheus.ExponentialBuckets(0.001, 2, 10),
})

README

Create a comprehensive README for your exporter:

Describe the purpose of the exporter
List all exposed metrics with explanations
Document all configuration options
Include installation and usage instructions
Provide example Prometheus configuration
Show example dashboards or alerts

Security Considerations

Authentication and Authorization

Consider security when exposing metrics:

Add basic authentication for sensitive metrics
Use TLS for encrypting metrics traffic
Implement IP-based access control when necessary

go
// Simple basic auth example
func basicAuthMiddleware(handler http.Handler, username, password string) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        user, pass, ok := r.BasicAuth()
        
        if !ok || user != username || pass != password {
            w.Header().Set("WWW-Authenticate", `Basic realm="Restricted"`)
            http.Error(w, "Unauthorized", http.StatusUnauthorized)
            return
        }
        
        handler.ServeHTTP(w, r)
    })
}

func main() {
    // ... setup metrics and collectors ...
    
    metricsHandler := promhttp.Handler()
    
    // Apply basic auth if credentials are configured
    if *authUsername != "" && *authPassword != "" {
        metricsHandler = basicAuthMiddleware(metricsHandler, *authUsername, *authPassword)
    }
    
    http.Handle("/metrics", metricsHandler)
    // ... start server ...
}

Sensitive Information

Be careful with sensitive data:

Never expose credentials, tokens, or personal data in metrics
Avoid including sensitive information in labels
Consider hashing or obfuscating potentially sensitive identifiers

go
// Bad - exposing sensitive data
userActivity := prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "user_activity_total",
        Help: "User activity by user",
    },
    []string{"user_email", "credit_card_last_four"},
)

// Better - anonymized approach
userActivity := prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "user_activity_total",
        Help: "User activity by user",
    },
    []string{"user_id_hash", "payment_method_type"},
)

Real-World Examples

Let's look at a complete exporter example that follows best practices:

go
package main

import (
    "context"
    "flag"
    "fmt"
    "log"
    "net/http"
    "os"
    "os/signal"
    "sync"
    "syscall"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

// Exporter configuration
var (
    listenAddress = flag.String("web.listen-address", ":9100", "Address to listen on for web interface and telemetry")
    metricsPath   = flag.String("web.telemetry-path", "/metrics", "Path under which to expose metrics")
    collectTimeout = flag.Duration("collector.timeout", 5*time.Second, "Timeout for metrics collection")
)

// Exporter metrics
var (
    // Application metrics
    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "app_requests_total",
            Help: "Total number of requests by status code and method",
        },
        []string{"status_code", "method"},
    )

    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "app_request_duration_seconds",
            Help:    "Request duration distribution in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )

    activeConnections = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "app_active_connections",
            Help: "Current number of active connections",
        },
    )

    // Self-monitoring metrics
    scrapeDuration = prometheus.NewSummary(
        prometheus.SummaryOpts{
            Name: "exporter_scrape_duration_seconds",
            Help: "Duration of metrics collection",
        },
    )

    scrapeErrors = prometheus.NewCounter(
        prometheus.CounterOpts{
            Name: "exporter_scrape_errors_total",
            Help: "Total number of errors during metrics collection",
        },
    )
)

func init() {
    // Register application metrics
    prometheus.MustRegister(requestsTotal)
    prometheus.MustRegister(requestDuration)
    prometheus.MustRegister(activeConnections)

    // Register self-monitoring metrics
    prometheus.MustRegister(scrapeDuration)
    prometheus.MustRegister(scrapeErrors)
}

// collectMetrics gathers all metrics from the application
func collectMetrics(ctx context.Context) error {
    start := time.Now()
    defer func() {
        scrapeDuration.Observe(time.Since(start).Seconds())
    }()

    // Simulate collecting metrics from an application
    var collectErr error
    errChan := make(chan error, 1)
    
    go func() {
        // Connect to the application and collect metrics
        time.Sleep(100 * time.Millisecond) // Simulate work
        
        // Record active connections
        activeConnections.Set(42)
        
        // Record request counts
        requestsTotal.WithLabelValues("200", "GET").Add(30)
        requestsTotal.WithLabelValues("404", "GET").Add(5)
        requestsTotal.WithLabelValues("500", "POST").Add(2)
        
        // Record request durations
        requestDuration.WithLabelValues("GET", "/api/users").Observe(0.05)
        requestDuration.WithLabelValues("POST", "/api/users").Observe(0.3)
        
        errChan <- nil
    }()
    
    select {
    case collectErr = <-errChan:
        // Collection completed
    case <-ctx.Done():
        collectErr = fmt.Errorf("metrics collection timed out")
    }
    
    if collectErr != nil {
        scrapeErrors.Inc()
    }
    
    return collectErr
}

func main() {
    flag.Parse()
    
    // Create a context that will be canceled on SIGTERM/SIGINT
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()
    
    // Handle graceful shutdown
    go func() {
        sigChan := make(chan os.Signal, 1)
        signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT)
        <-sigChan
        log.Println("Received shutdown signal, exiting...")
        cancel()
    }()
    
    // Set up HTTP server
    mux := http.NewServeMux()
    
    // Landing page
    mux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        if r.URL.Path != "/" {
            http.NotFound(w, r)
            return
        }
        w.Write([]byte(`<html>
            <head><title>Example Exporter</title></head>
            <body>
            <h1>Example Exporter</h1>
            <p><a href="` + *metricsPath + `">Metrics</a></p>
            </body>
            </html>`))
    })
    
    // Metrics endpoint with collection
    mux.HandleFunc(*metricsPath, func(w http.ResponseWriter, r *http.Request) {
        // Create a new context with timeout for metrics collection
        timeoutCtx, cancel := context.WithTimeout(ctx, *collectTimeout)
        defer cancel()
        
        // Collect metrics
        err := collectMetrics(timeoutCtx)
        if err != nil {
            log.Printf("Error collecting metrics: %v", err)
            // Continue with partial metrics
        }
        
        // Expose all registered metrics
        promhttp.Handler().ServeHTTP(w, r)
    })
    
    // Start HTTP server
    server := &http.Server{
        Addr:    *listenAddress,
        Handler: mux,
    }
    
    // Start server in a goroutine
    go func() {
        log.Printf("Starting exporter on %s", *listenAddress)
        if err := server.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatalf("Error starting HTTP server: %v", err)
        }
    }()
    
    // Wait for context cancellation (shutdown signal)
    <-ctx.Done()
    
    // Shut down HTTP server with a timeout
    shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer shutdownCancel()
    
    if err := server.Shutdown(shutdownCtx); err != nil {
        log.Fatalf("Error shutting down HTTP server: %v", err)
    }
    
    log.Println("Exporter stopped")
}

Example Prometheus Configuration

To scrape the exporter we just built, the Prometheus configuration would look like this:

yaml
scrape_configs:
  - job_name: 'example_exporter'
    scrape_interval: 15s
    scrape_timeout: 10s
    static_configs:
      - targets: ['localhost:9100']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: 'example-app'

Visualization

Here's how you might visualize the metrics in Grafana with a sample dashboard:

The corresponding Grafana dashboard would use PromQL queries like:

Request Rate:

sum(rate(app_requests_total[5m])) by (method)

Error Rate:

sum(rate(app_requests_total{status_code=~"5.."}[5m])) / sum(rate(app_requests_total[5m]))

Response Time (90th percentile):

histogram_quantile(0.9, sum(rate(app_request_duration_seconds_bucket[5m])) by (le, endpoint))

Active Connections:
```
app_active_connections
```

Summary

Developing Prometheus exporters following best practices ensures your monitoring system is robust, scalable, and provides valuable insights. Key takeaways include:

Consistent naming - Use clear, consistent naming conventions for metrics and labels
Appropriate metric types - Choose counter, gauge, histogram, or summary based on what you're measuring
Performance optimization - Minimize resource usage and cardinality
Reliability - Handle errors gracefully and implement timeouts
Self-monitoring - Expose metrics about your exporter's operation
Documentation - Provide clear documentation for all metrics
Security - Protect sensitive data and implement authentication when needed

By following these best practices, you'll create exporters that seamlessly integrate with your Prometheus monitoring ecosystem and provide valuable insights into your systems.

Additional Resources

Exercises

Create a simple exporter that exposes system metrics (CPU, memory, disk) following the best practices outlined in this guide.
Review an existing exporter and identify areas where it could be improved based on these best practices.
Design a set of metrics for a web application that would provide comprehensive monitoring while avoiding high cardinality issues.
Implement self-monitoring for an existing exporter to track its performance and reliability.
Create a Grafana dashboard that effectively visualizes the metrics from your exporter, with useful panels and alerts.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Naming Conventions​

Metric Naming​

Label Naming​

Metric Types​

Counter​

Gauge​

Histogram​

Summary​

Performance Considerations​

Resource Usage​

Cardinality​

Reliability and Availability​

Error Handling​

Timeouts​

Operational Considerations​

Self-Monitoring​

Configuration​

Documentation​

Metric Documentation​

README​

Security Considerations​

Authentication and Authorization​

Sensitive Information​

Real-World Examples​

Example Prometheus Configuration​

Visualization​

Summary​

Additional Resources​

Exercises​

Introduction

Naming Conventions

Metric Naming

Label Naming

Metric Types

Counter

Gauge

Histogram

Summary

Performance Considerations

Resource Usage

Cardinality

Reliability and Availability

Error Handling

Timeouts

Operational Considerations

Self-Monitoring

Configuration

Documentation

Metric Documentation

README

Security Considerations

Authentication and Authorization

Sensitive Information

Real-World Examples

Example Prometheus Configuration

Visualization

Summary

Additional Resources

Exercises