Common Metric Patterns
Introduction
LogQL, Loki's query language, provides powerful capabilities not just for retrieving logs but also for extracting metrics from your log data. This ability to transform logs into metrics bridges the gap between logging and monitoring systems, enabling you to derive quantitative insights from qualitative log data.
In this guide, we'll explore common metric patterns in LogQL that help you extract valuable metrics from your logs. These patterns are essential for effective monitoring, alerting, and visualization in Grafana Loki.
Understanding LogQL Metric Queries
Before diving into specific patterns, let's understand the basic structure of a LogQL metric query:
logQL_stream_selector | logQL_pipeline_stages | metric_extraction_function(parameter)
A metric query in LogQL consists of three main components:
- Stream selector: Selects the log streams to query (e.g.,
{app="frontend"}
) - Pipeline stages: Optional transformations to prepare log lines (e.g.,
| json | line_format "{{.message}}"
) - Metric extraction: Functions that convert logs into numeric values (e.g.,
| rate()
)
Now, let's explore common metric patterns you'll use regularly.
Common Metric Patterns
1. Rate of Log Lines
The most basic metric pattern is measuring the rate of log occurrences. This is useful for tracking error rates, activity levels, or any event frequency.
Example: Error Rate Monitoring
{app="payment-service"} |= "error" | rate(1m)
This query:
- Selects logs from the payment service
- Filters for lines containing "error"
- Calculates the rate of these errors per second over a 1-minute window
Output Visualization
When visualized in Grafana, this produces a time series showing error frequency, allowing you to spot spikes or abnormal patterns.
2. Counting Specific Events
The count_over_time
function lets you count occurrences within specific time windows.
Example: Login Attempts
{app="auth-service"} |= "login attempt" | count_over_time(5m)
This counts login attempts in 5-minute buckets, useful for authentication monitoring.
Difference from Rate
While rate()
gives per-second values, count_over_time()
provides absolute counts within each time window.
3. Extracting and Aggregating Numeric Values
A powerful pattern is extracting numeric values from logs and aggregating them.
Example: Average Response Time
{app="web-server"}
| regexp `.*response_time=(?P<response_time>[0-9]+).*`
| unwrap response_time
| avg_over_time(1m)
This query:
- Selects web server logs
- Extracts response time using a regular expression
- Unwraps the extracted value
- Calculates the average over 1-minute windows
4. Calculating Percentiles
For performance monitoring, percentiles are often more useful than averages.
Example: 95th Percentile Response Time
{app="api-gateway"}
| json
| duration > 0
| unwrap duration
| quantile_over_time(0.95, 5m)
This query extracts the 95th percentile of API response durations over 5-minute windows.
5. Creating Histograms
Histograms provide distribution visualization of your metrics.
Example: Response Time Distribution
{app="web-server"}
| regexp `.*response_time=(?P<response_time>[0-9]+).*`
| unwrap response_time
| histogram_quantile(0.5, sum by(le) (rate(${__range})))
This creates a histogram of response times and calculates the median (50th percentile).
6. Group-Based Metrics
Grouping metrics by labels allows for comparative analysis.
Example: Error Rates by Service
{environment="production"} |= "ERROR"
| label_format service_name="{{ service }}"
| rate(5m)
| sum by(service_name)
This query:
- Selects production error logs
- Formats a service name label
- Calculates error rates
- Sums rates by service
7. Absent Metrics (Detecting Missing Logs)
Sometimes the absence of logs is as important as their presence.
Example: Detecting Service Silence
absent(count_over_time({app="critical-service"} [10m]))
This returns a value of 1 when the critical service has no logs for 10 minutes, useful for detecting silent failures.
Applying Metrics in Real-World Scenarios
Let's explore how these patterns apply to real-world monitoring scenarios.
Application Performance Monitoring
{app="ecommerce"}
| json
| unwrap_duration(request_time)
| sum by (endpoint) (rate(1m))
This tracks request rates across different endpoints in an e-commerce application.
Error Budgeting
1 - (
sum(rate({app="payment-gateway"} |= "success" [1h]))
/
sum(rate({app="payment-gateway"} [1h]))
)
This calculates the error rate as a proportion of total requests, useful for SLO monitoring.
Security Monitoring
{app="auth-service"}
|= "failed login"
| json
| label_format user="{{ username }}"
| count_over_time(5m)
| sum by(user)
This counts failed login attempts per user, helping identify potential brute force attacks.
Visualizing Metrics in Grafana
These LogQL metric patterns can be visualized in Grafana using:
- Time series panels: For rate, count, and numeric metrics over time
- Gauge panels: For current values against thresholds
- Bar charts: For comparing metrics across different services/components
- Heatmaps: For visualizing histogram data
Common Pitfalls and Optimization
Potential Issues
- High cardinality: Be cautious with high-cardinality labels in group operations
- Resource consumption: Complex metric queries can be resource-intensive
- Time window selection: Too small windows cause noise, too large lose detail
Optimizations
# Instead of this (processes all logs first)
{app="busy-service"} | json | status=~"5.." | rate(1m)
# Do this (filters first)
{app="busy-service"} |= "status\":5" | json | rate(1m)
Always filter logs as early as possible in your query to improve performance.
Combining Metric Patterns
You can combine patterns for more sophisticated monitoring:
Summary
LogQL metric patterns transform your logs into valuable metrics, enabling:
- Rate-based monitoring for events and errors
- Extraction and analysis of numeric values
- Distribution analysis through percentiles and histograms
- Comparative analysis through grouping
- Detection of missing logs
By mastering these patterns, you can build comprehensive monitoring solutions that leverage your log data beyond traditional log analysis.
Exercises
- Create a query to monitor the 99th percentile response time for a web application
- Develop a query that tracks error rates by HTTP status code
- Build a query to detect when a critical service hasn't logged any activity for 5 minutes
- Create a dashboard showing the top 5 users by login frequency
Further Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)