Aggregation Operators
Introduction
When working with logs in Grafana Loki, you'll often need to transform your raw log data into meaningful metrics for visualization and alerting. This is where aggregation operators in LogQL come into play. These powerful functions allow you to combine, group, and summarize log data in various ways to extract valuable insights.
Aggregation operators are essential components of LogQL's metrics capabilities that help you:
- Summarize large volumes of log data into concise metrics
- Calculate statistical measures across log entries
- Group related data for better analysis
- Transform raw log data into time series for visualization
In this guide, we'll explore the various aggregation operators available in LogQL, their syntax, and how to use them effectively in real-world scenarios.
Understanding Aggregation Operators
Aggregation operators in LogQL take the extracted values from your logs and perform calculations to produce summary metrics. These operators are applied to log streams that have been filtered and from which values have been extracted.
Basic Syntax
The general syntax for using aggregation operators in LogQL is:
<aggregation_operator>([parameter,] <expression>) [by|without (<label list>)]
Where:
<aggregation_operator>
is the function you want to apply (sum, avg, min, etc.)<expression>
is the metric expression to aggregate[by|without (<label list>)]
is an optional clause to control grouping
Common Aggregation Operators
Let's explore the most commonly used aggregation operators in LogQL:
Sum
The sum
operator adds up all values across the specified range.
sum(rate({app="frontend"}[5m]))
This query calculates the sum of rates across all log streams from the frontend app over 5-minute intervals.
Average (avg)
The avg
operator calculates the average value across the specified range.
avg(rate({app="api"} |= "error" [5m]))
This query returns the average error rate across all API log streams over 5-minute intervals.
Min and Max
The min
and max
operators return the minimum and maximum values respectively.
max(bytes_rate({app="fileserver"}[10m]))
This query returns the maximum byte transfer rate for the fileserver app over 10-minute intervals.
Count
The count
operator counts the number of elements or entries.
count(count_over_time({app="auth"} |= "login" [1h]))
This query counts the total number of login events across all authentication service logs over the past hour.
Quantile
The quantile
operator calculates the specified quantile (0-1) across the range.
quantile(0.95, http_request_duration_seconds{handler="/api/users"})
This query calculates the 95th percentile of HTTP request durations for the "/api/users" endpoint.
Grouping with "by" and "without"
Aggregation operators become even more powerful when combined with grouping clauses:
Using "by"
The by
clause groups results by the specified labels.
sum(rate({app="web"}[5m])) by (status_code)
This query groups the sum of rates by status code, allowing you to see the distribution of request rates across different HTTP status codes.
Using "without"
The without
clause excludes specified labels from the grouping.
avg(rate({app="database"}[5m])) without (instance)
This query calculates the average rate across all database instances, removing the instance label from the results.
Practical Examples
Example 1: Error Rate by Service
Let's calculate the error rate by service across our application:
sum(rate({environment="production"} |= "error" [5m])) by (service)
/
sum(rate({environment="production"}[5m])) by (service)
This query:
- Calculates the rate of error logs for each service
- Divides by the total rate of logs for each service
- Returns the error percentage by service
Example 2: 99th Percentile Latency
To monitor performance, we might want to track the 99th percentile latency:
quantile(0.99,
sum(rate({app="api"}
| json | unwrap latency_ms [5m]))
by (endpoint)
)
This query:
- Extracts the
latency_ms
field from JSON logs - Calculates the rate over 5-minute windows
- Groups by endpoint
- Calculates the 99th percentile value
Example 3: Top 5 Most Active Users
Here's how to identify the top 5 most active users in a system:
topk(5,
sum(count_over_time({app="userservice"}
| json | unwrap user_id [1h]))
by (user_id)
)
This query:
- Counts occurrences of each user_id over the past hour
- Groups by user_id
- Returns only the top 5 most frequent users
Advanced Aggregation Operators
LogQL also provides several advanced aggregation operators for more complex analyses:
topk and bottomk
The topk
and bottomk
operators select the top or bottom K elements by value.
topk(3, sum(rate({app="api"}[5m])) by (endpoint))
This returns the 3 endpoints with the highest request rates.
stddev and stdvar
Calculate standard deviation and variance across values.
stddev(rate({app="payment"} | unwrap response_time [5m])) by (endpoint)
This calculates the standard deviation of response times across different payment endpoints.
Visualization with Aggregation Operators
Aggregation operators are particularly useful when creating visualizations in Grafana. Here's a simple example of how you might use them to create a meaningful dashboard panel:
For example, to create a panel showing HTTP error rates by service:
- Create a new panel in Grafana
- Use the following LogQL query:
logql
sum(rate({environment="production"} |= "status=5" [5m])) by (service)
/
sum(rate({environment="production"} |= "status=" [5m])) by (service) - Choose a visualization type (like a bar gauge or time series)
- Configure thresholds and alerts based on acceptable error rates
Summary
Aggregation operators are powerful tools in LogQL that transform raw log data into actionable metrics. They allow you to:
- Calculate rates, sums, averages, and other statistical measures
- Group and categorize data for better analysis
- Identify trends, anomalies, and patterns in your logs
- Create meaningful visualizations for monitoring and alerting
By mastering aggregation operators, you can unlock the full potential of your log data in Grafana Loki, turning what might otherwise be overwhelming volumes of information into clear, actionable insights.
Exercises
- Write a LogQL query to calculate the average response time by service and endpoint.
- Create a query to find the 5 endpoints with the highest error rates.
- Develop a query that compares the request rate today with the same period yesterday.
- Write a query to calculate the 95th percentile of memory usage across all your application instances.
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)