Creating Alerts from Metrics

Introduction

Alerts are a crucial part of any monitoring system. They notify you when something goes wrong, allowing you to respond quickly to issues before they impact your users. In Grafana Loki, you can create alerts based on metrics derived from your logs using LogQL, enabling proactive monitoring of your applications.

This guide will walk you through the process of creating alerts from LogQL metrics, explaining the concepts step by step and providing practical examples.

Prerequisites

Before diving into alerts, make sure you:

Have Grafana and Loki set up and configured
Understand basic LogQL queries and metrics
Have logs flowing into your Loki instance

Understanding Alerting Concepts

Alerts in Grafana Loki follow a simple but powerful workflow:

Query: You define a LogQL metrics query that calculates values from your logs
Condition: You set conditions that determine when an alert should fire
Notification: You configure how and who to notify when the alert triggers
Resolution: The alert resolves when the condition is no longer met

Creating Your First Alert from LogQL Metrics

Let's walk through creating a basic alert that will notify you when HTTP error rates exceed a threshold.

Step 1: Define Your LogQL Metrics Query

First, we need a LogQL query that calculates error rates from our logs:

sum(rate({app="web-app"} |= "status=5xx" [5m])) 
  / 
sum(rate({app="web-app"} [5m])) * 100

This query:

Counts 5xx errors in our web application logs
Divides by the total number of logs
Multiplies by 100 to get a percentage

Step 2: Create a New Alert Rule

In Grafana:

Navigate to "Alerting" in the left sidebar
Click "New alert rule"
Select "Loki" as your data source
Paste your LogQL metrics query
Set the evaluation interval (how often Grafana checks your condition)

Step 3: Define Alert Conditions

Now, set the conditions that determine when your alert should fire:

WHEN last() OF query(A, 5m, now) > 5

This condition triggers when the error rate exceeds 5% for the most recent data point.

Step 4: Configure Alert Details

Provide the following details for your alert:

Rule name: "High HTTP Error Rate"
Description: "Alert when HTTP 5xx errors exceed 5% of total requests"
Summary: "High error rate detected for web-app"
Severity: "warning" or "critical" depending on your needs

Step 5: Set Up Notifications

Configure notification channels such as:

Email
Slack
PagerDuty
WebHooks
Other integrated notification systems

For example, a Slack notification might appear like:

⚠️ [WARNING] High HTTP Error Rate
Web-app is experiencing 7.2% error rate (threshold: 5%)
Time: 2023-06-15 14:32:21

Advanced Alerting Techniques

Once you're comfortable with basic alerts, you can explore more advanced techniques.

Multi-Condition Alerts

You can create more sophisticated alerts by combining multiple conditions:

sum(rate({app="web-app", environment="production"} |= "status=5xx" [5m])) > 100
AND
sum(rate({app="web-app", environment="production"} |= "status=5xx" [5m])) 
  / 
sum(rate({app="web-app", environment="production"} [5m])) * 100 > 5

This alert triggers when both the absolute number of errors exceeds 100 AND the error rate exceeds 5%.

Alerting with Log Patterns

You can use LogQL's parsing capabilities to alert on specific patterns in your logs:

sum(rate({app="payment-service"} | json | status=~"failed|rejected" [5m])) > 10

This would alert when payment failures exceed 10 per minute.

Implementing Alert Deduplication

For noisy logs or frequently triggering alerts, implement deduplication:

sum by(endpoint) (
  rate({app="api-gateway"} | json | status >= 500 [5m])
) > 5

This groups errors by endpoint, so you receive distinct alerts for each problematic endpoint.

Practical Example: Complete Production Alert

Let's create a comprehensive alert for monitoring API response times:

// Query A: Calculate 95th percentile response time
sum by(service) (
  quantile_over_time(0.95, 
    {app="api-gateway"} 
    | json 
    | unwrap response_time_ms [5m]
  )
)

// Alert condition
WHEN last() OF query(A, 5m, now) > 1000
FOR 15m

This alert will fire when the 95th percentile response time for any service exceeds 1000ms for 15 consecutive minutes.

In the notification, you might include:

🚨 High API Latency Detected

Service: {{$labels.service}}
Current P95 latency: {{$value}}ms
Threshold: 1000ms
Duration: 15+ minutes

Please investigate immediately as users are experiencing slow responses.

Best Practices for LogQL Metrics Alerts

When creating alerts from LogQL metrics, follow these best practices:

Avoid alert fatigue: Set appropriate thresholds that balance sensitivity with importance
Use rate functions: For most metrics, use rate() or count_over_time() rather than raw counts
Add context: Include enough information in notifications to understand the issue quickly
Consider trends: Alert on trends rather than single data points where appropriate
Test thoroughly: Always test your alerts before deploying them to production
Document alerts: Keep a record of all alerts and their intended purpose
Implement escalation paths: Define different severity levels and appropriate response procedures

Troubleshooting Alerts

Common issues with LogQL metrics alerts include:

Too Many Alerts (Alert Storm)

If you're receiving too many alerts:

Increase thresholds
Add aggregation to group similar problems
Implement alert grouping in your notification system

No Alerts When Expected

If alerts aren't firing when they should:

Verify your LogQL query returns data (test in Explore view)
Check if evaluation intervals are too long
Ensure notification channels are correctly configured

Summary

Creating alerts from LogQL metrics allows you to proactively monitor your applications by:

Defining meaningful metrics using LogQL
Setting appropriate thresholds and conditions
Configuring effective notifications
Following best practices to avoid alert fatigue

This approach transforms Loki from a log storage system into a powerful monitoring solution that helps maintain system reliability and quickly respond to issues.

Exercise: Create Your Own Alerts

To reinforce your learning, try creating these alerts:

Alert when log volume drops significantly (potential log pipeline issue)
Alert on increased error rates for a specific customer or tenant
Alert when a particular log message appears that requires immediate attention
Create a multi-condition alert that considers both error rates and system metrics

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

Understanding Alerting Concepts​

Creating Your First Alert from LogQL Metrics​

Step 1: Define Your LogQL Metrics Query​

Step 2: Create a New Alert Rule​

Step 3: Define Alert Conditions​

Step 4: Configure Alert Details​

Step 5: Set Up Notifications​

Advanced Alerting Techniques​

Multi-Condition Alerts​

Alerting with Log Patterns​

Implementing Alert Deduplication​

Practical Example: Complete Production Alert​

Best Practices for LogQL Metrics Alerts​

Troubleshooting Alerts​

Too Many Alerts (Alert Storm)​

No Alerts When Expected​

Summary​

Exercise: Create Your Own Alerts​

Additional Resources​