Creating Alerts from Metrics
Introduction
Alerts are a crucial part of any monitoring system. They notify you when something goes wrong, allowing you to respond quickly to issues before they impact your users. In Grafana Loki, you can create alerts based on metrics derived from your logs using LogQL, enabling proactive monitoring of your applications.
This guide will walk you through the process of creating alerts from LogQL metrics, explaining the concepts step by step and providing practical examples.
Prerequisites
Before diving into alerts, make sure you:
- Have Grafana and Loki set up and configured
- Understand basic LogQL queries and metrics
- Have logs flowing into your Loki instance
Understanding Alerting Concepts
Alerts in Grafana Loki follow a simple but powerful workflow:
- Query: You define a LogQL metrics query that calculates values from your logs
- Condition: You set conditions that determine when an alert should fire
- Notification: You configure how and who to notify when the alert triggers
- Resolution: The alert resolves when the condition is no longer met
Creating Your First Alert from LogQL Metrics
Let's walk through creating a basic alert that will notify you when HTTP error rates exceed a threshold.
Step 1: Define Your LogQL Metrics Query
First, we need a LogQL query that calculates error rates from our logs:
sum(rate({app="web-app"} |= "status=5xx" [5m]))
/
sum(rate({app="web-app"} [5m])) * 100
This query:
- Counts 5xx errors in our web application logs
- Divides by the total number of logs
- Multiplies by 100 to get a percentage
Step 2: Create a New Alert Rule
In Grafana:
- Navigate to "Alerting" in the left sidebar
- Click "New alert rule"
- Select "Loki" as your data source
- Paste your LogQL metrics query
- Set the evaluation interval (how often Grafana checks your condition)
Step 3: Define Alert Conditions
Now, set the conditions that determine when your alert should fire:
WHEN last() OF query(A, 5m, now) > 5
This condition triggers when the error rate exceeds 5% for the most recent data point.
Step 4: Configure Alert Details
Provide the following details for your alert:
- Rule name: "High HTTP Error Rate"
- Description: "Alert when HTTP 5xx errors exceed 5% of total requests"
- Summary: "High error rate detected for web-app"
- Severity: "warning" or "critical" depending on your needs
Step 5: Set Up Notifications
Configure notification channels such as:
- Slack
- PagerDuty
- WebHooks
- Other integrated notification systems
For example, a Slack notification might appear like:
⚠️ [WARNING] High HTTP Error Rate
Web-app is experiencing 7.2% error rate (threshold: 5%)
Time: 2023-06-15 14:32:21
Advanced Alerting Techniques
Once you're comfortable with basic alerts, you can explore more advanced techniques.
Multi-Condition Alerts
You can create more sophisticated alerts by combining multiple conditions:
sum(rate({app="web-app", environment="production"} |= "status=5xx" [5m])) > 100
AND
sum(rate({app="web-app", environment="production"} |= "status=5xx" [5m]))
/
sum(rate({app="web-app", environment="production"} [5m])) * 100 > 5
This alert triggers when both the absolute number of errors exceeds 100 AND the error rate exceeds 5%.
Alerting with Log Patterns
You can use LogQL's parsing capabilities to alert on specific patterns in your logs:
sum(rate({app="payment-service"} | json | status=~"failed|rejected" [5m])) > 10
This would alert when payment failures exceed 10 per minute.
Implementing Alert Deduplication
For noisy logs or frequently triggering alerts, implement deduplication:
sum by(endpoint) (
rate({app="api-gateway"} | json | status >= 500 [5m])
) > 5
This groups errors by endpoint, so you receive distinct alerts for each problematic endpoint.
Practical Example: Complete Production Alert
Let's create a comprehensive alert for monitoring API response times:
// Query A: Calculate 95th percentile response time
sum by(service) (
quantile_over_time(0.95,
{app="api-gateway"}
| json
| unwrap response_time_ms [5m]
)
)
// Alert condition
WHEN last() OF query(A, 5m, now) > 1000
FOR 15m
This alert will fire when the 95th percentile response time for any service exceeds 1000ms for 15 consecutive minutes.
In the notification, you might include:
🚨 High API Latency Detected
Service: {{$labels.service}}
Current P95 latency: {{$value}}ms
Threshold: 1000ms
Duration: 15+ minutes
Please investigate immediately as users are experiencing slow responses.
Best Practices for LogQL Metrics Alerts
When creating alerts from LogQL metrics, follow these best practices:
- Avoid alert fatigue: Set appropriate thresholds that balance sensitivity with importance
- Use rate functions: For most metrics, use
rate()
orcount_over_time()
rather than raw counts - Add context: Include enough information in notifications to understand the issue quickly
- Consider trends: Alert on trends rather than single data points where appropriate
- Test thoroughly: Always test your alerts before deploying them to production
- Document alerts: Keep a record of all alerts and their intended purpose
- Implement escalation paths: Define different severity levels and appropriate response procedures
Troubleshooting Alerts
Common issues with LogQL metrics alerts include:
Too Many Alerts (Alert Storm)
If you're receiving too many alerts:
- Increase thresholds
- Add aggregation to group similar problems
- Implement alert grouping in your notification system
No Alerts When Expected
If alerts aren't firing when they should:
- Verify your LogQL query returns data (test in Explore view)
- Check if evaluation intervals are too long
- Ensure notification channels are correctly configured
Summary
Creating alerts from LogQL metrics allows you to proactively monitor your applications by:
- Defining meaningful metrics using LogQL
- Setting appropriate thresholds and conditions
- Configuring effective notifications
- Following best practices to avoid alert fatigue
This approach transforms Loki from a log storage system into a powerful monitoring solution that helps maintain system reliability and quickly respond to issues.
Exercise: Create Your Own Alerts
To reinforce your learning, try creating these alerts:
- Alert when log volume drops significantly (potential log pipeline issue)
- Alert on increased error rates for a specific customer or tenant
- Alert when a particular log message appears that requires immediate attention
- Create a multi-condition alert that considers both error rates and system metrics
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)