Batch Job Monitoring
Introduction
Batch jobs are programs that run at scheduled intervals without user interaction. Examples include data processing tasks, backup scripts, log rotations, and scheduled cleanups. Unlike long-running services, batch jobs execute for a limited time and then terminate, creating unique monitoring challenges.
In this guide, we'll explore how to effectively monitor batch jobs using Prometheus, addressing common challenges like:
- Detecting job failures
- Measuring execution time
- Tracking success/failure rates
- Alerting when jobs don't run on schedule
- Monitoring resource consumption
Why Batch Job Monitoring is Challenging
Traditional monitoring approaches often focus on long-running services where metrics are continuously available. Batch jobs present several challenges:
- Intermittent execution: Jobs only emit metrics while running
- Ephemeral nature: Once a job completes, its process (and metrics) disappear
- Schedule adherence: Need to detect if a job didn't run at all
- Binary success/failure: Often need to track outcomes rather than states
Prometheus Approach to Batch Job Monitoring
Prometheus offers several patterns for effectively monitoring batch jobs:
1. The Pushgateway Pattern
The Prometheus Pushgateway is a component that allows ephemeral jobs to push their metrics to an intermediate service where Prometheus can scrape them later.
Setting Up the Pushgateway
First, install and run the Pushgateway:
# Using Docker
docker run -d -p 9091:9091 prom/pushgateway
Configure Prometheus to scrape metrics from the Pushgateway:
scrape_configs:
- job_name: 'pushgateway'
scrape_interval: 10s
honor_labels: true # Important: Preserves job labels from pushed metrics
static_configs:
- targets: ['localhost:9091']
2. Key Metrics for Batch Jobs
Success/Failure Counters
For tracking job outcomes:
from prometheus_client import Counter, push_to_gateway
# Create counters
success_counter = Counter('batch_job_success_total', 'Number of successful job runs', ['job_name'])
failure_counter = Counter('batch_job_failure_total', 'Number of failed job runs', ['job_name'])
# Example batch job
job_name = 'data_processor'
try:
# Job logic here
# ...
# If successful
success_counter.labels(job_name=job_name).inc()
push_to_gateway('localhost:9091', job='batch-jobs', registry=None)
except Exception as e:
# If failed
failure_counter.labels(job_name=job_name).inc()
push_to_gateway('localhost:9091', job='batch-jobs', registry=None)
raise e
Execution Time
Measure how long your jobs take to run:
from prometheus_client import Gauge, push_to_gateway
import time
# Create a gauge for execution time
execution_time = Gauge('batch_job_duration_seconds', 'Duration of batch job execution', ['job_name'])
# Example batch job with timing
job_name = 'nightly_backup'
start_time = time.time()
try:
# Job logic here
# ...
# Record execution time
duration = time.time() - start_time
execution_time.labels(job_name=job_name).set(duration)
push_to_gateway('localhost:9091', job='batch-jobs', registry=None)
except Exception as e:
# Still record execution time on failure
duration = time.time() - start_time
execution_time.labels(job_name=job_name).set(duration)
push_to_gateway('localhost:9091', job='batch-jobs', registry=None)
raise e
Last Execution Timestamp
Track when the job last ran:
from prometheus_client import Gauge, push_to_gateway
import time
# Create a gauge for last execution timestamp
last_execution = Gauge('batch_job_last_execution_timestamp', 'Timestamp of last execution', ['job_name'])
job_name = 'hourly_data_sync'
# Set to current UNIX timestamp
last_execution.labels(job_name=job_name).set_to_current_time()
push_to_gateway('localhost:9091', job='batch-jobs', registry=None)
3. Monitoring Batch Jobs with Cron Exporter
For cron-based jobs, you can use a dedicated exporter like prometheus-cron-exporter
or the Node Exporter's textfile collector.
Example using textfile collector:
- Create a script that runs alongside your job:
#!/bin/bash
JOB_NAME="database_backup"
METRICS_DIR="/var/lib/node_exporter/textfile_collector"
# Run before the job
echo "# HELP batch_job_start_time_seconds Timestamp when the job started" > $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "# TYPE batch_job_start_time_seconds gauge" >> $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "batch_job_start_time_seconds{job_name=\"${JOB_NAME}\"} $(date +%s)" >> $METRICS_DIR/${JOB_NAME}.prom.tmp
# Run the actual job
START_TIME=$(date +%s)
if /path/to/your/batch/job; then
EXIT_CODE=0
else
EXIT_CODE=$?
fi
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
# After the job completes, record metrics
echo "# HELP batch_job_duration_seconds Duration of the job" > $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "# TYPE batch_job_duration_seconds gauge" >> $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "batch_job_duration_seconds{job_name=\"${JOB_NAME}\"} ${DURATION}" >> $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "# HELP batch_job_last_execution_timestamp_seconds Timestamp of the last execution" >> $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "# TYPE batch_job_last_execution_timestamp_seconds gauge" >> $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "batch_job_last_execution_timestamp_seconds{job_name=\"${JOB_NAME}\"} ${END_TIME}" >> $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "# HELP batch_job_exit_code Exit code of the last execution" >> $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "# TYPE batch_job_exit_code gauge" >> $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "batch_job_exit_code{job_name=\"${JOB_NAME}\"} ${EXIT_CODE}" >> $METRICS_DIR/${JOB_NAME}.prom.tmp
mv $METRICS_DIR/${JOB_NAME}.prom.tmp $METRICS_DIR/${JOB_NAME}.prom
Configure Node Exporter to read these metrics:
node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile_collector
Creating Effective Alerts for Batch Jobs
1. Alert on Job Failures
groups:
- name: batch-jobs
rules:
- alert: BatchJobFailed
expr: increase(batch_job_failure_total{job_name=~".*"}[1h]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Batch job {{ $labels.job_name }} failed"
description: "The batch job {{ $labels.job_name }} has failed at least once in the last hour."
2. Alert on Missing Jobs
groups:
- name: batch-jobs
rules:
- alert: BatchJobMissed
expr: time() - batch_job_last_execution_timestamp{job_name="hourly_data_sync"} > 3600 * 1.5
for: 5m
labels:
severity: warning
annotations:
summary: "Batch job {{ $labels.job_name }} missed its schedule"
description: "The batch job {{ $labels.job_name }} hasn't run in the last 1.5x its expected interval."
3. Alert on Unusually Long Execution Times
groups:
- name: batch-jobs
rules:
- alert: BatchJobTooSlow
expr: batch_job_duration_seconds{job_name="nightly_backup"} > 3600
for: 5m
labels:
severity: warning
annotations:
summary: "Batch job {{ $labels.job_name }} running too slow"
description: "The batch job {{ $labels.job_name }} took longer than expected to complete."
Real-World Example: Database Backup Monitoring
Let's implement a complete example for monitoring database backups:
#!/usr/bin/env python3
import time
import subprocess
import sys
from prometheus_client import Counter, Gauge, push_to_gateway
# Define metrics
success = Counter('db_backup_success_total', 'Successful DB backups', ['database'])
failure = Counter('db_backup_failure_total', 'Failed DB backups', ['database', 'error_type'])
duration = Gauge('db_backup_duration_seconds', 'Time taken for backup', ['database'])
size = Gauge('db_backup_size_bytes', 'Size of the backup file', ['database'])
last_run = Gauge('db_backup_last_run_timestamp', 'When the backup last ran', ['database'])
# Database to back up
db_name = 'production_db'
# Record start time and last run timestamp
start_time = time.time()
last_run.labels(database=db_name).set_to_current_time()
try:
# Run the backup command
process = subprocess.run(
['pg_dump', '-U', 'postgres', '-d', db_name, '-f', f'/backups/{db_name}.sql'],
check=True,
capture_output=True
)
# Get backup file size
file_size = subprocess.run(
['stat', '-c', '%s', f'/backups/{db_name}.sql'],
check=True,
capture_output=True,
text=True
)
# Record metrics for success
backup_size = int(file_size.stdout.strip())
success.labels(database=db_name).inc()
size.labels(database=db_name).set(backup_size)
print(f"Backup completed successfully. Size: {backup_size} bytes")
except subprocess.CalledProcessError as e:
# Determine error type
if 'connection' in e.stderr.lower():
error_type = 'connection_error'
elif 'permission' in e.stderr.lower():
error_type = 'permission_error'
else:
error_type = 'unknown_error'
# Record failure
failure.labels(database=db_name, error_type=error_type).inc()
print(f"Backup failed: {e}", file=sys.stderr)
sys.exit(1)
finally:
# Always record duration
backup_duration = time.time() - start_time
duration.labels(database=db_name).set(backup_duration)
# Push all metrics to Pushgateway
push_to_gateway('localhost:9091', job='database_backup', registry=None)
Creating a Grafana Dashboard for Batch Jobs
Here's a sample Grafana dashboard query to visualize batch job performance:
# Success rate over time
sum(increase(batch_job_success_total{job_name="data_processor"}[1d])) /
(sum(increase(batch_job_success_total{job_name="data_processor"}[1d])) +
sum(increase(batch_job_failure_total{job_name="data_processor"}[1d])))
# Execution time trends
batch_job_duration_seconds{job_name="nightly_backup"}
# Job execution heatmap
batch_job_last_execution_timestamp_seconds{job_name="hourly_data_sync"}
Advanced Techniques
1. Using the Textfile Collector with Synthetic Metrics
For extremely simple batch jobs, you can generate "event-based" metrics using timestamps:
#!/bin/bash
echo "batch_job_completed{job=\"log_rotation\",status=\"success\"} $(date +%s)" > /var/lib/node_exporter/textfile_collector/log_rotation.prom
Then in Prometheus, you can use time-based functions:
# Alert if no successful completions in the last day
time() - batch_job_completed{job="log_rotation",status="success"} > 86400
2. Multi-Instance Batch Job Monitoring
When running the same batch job across multiple servers:
from prometheus_client import Counter, Gauge, push_to_gateway
import socket
# Include hostname in labels
hostname = socket.gethostname()
job_name = 'daily_report_generation'
success_counter = Counter(
'batch_job_success_total',
'Number of successful job runs',
['job_name', 'hostname']
)
# Increment with hostname
success_counter.labels(job_name=job_name, hostname=hostname).inc()
push_to_gateway('central-pushgateway:9091', job='batch-jobs', registry=None)
Summary
Monitoring batch jobs with Prometheus requires different approaches than monitoring long-running services. By using the Pushgateway, appropriate metrics, and well-crafted alerts, you can gain visibility into job health, performance, and reliability.
Key takeaways:
- Use Pushgateway for ephemeral batch job metrics
- Track success/failure counters, execution time, and last run time
- Create alerts for failures, missed schedules, and performance issues
- Consider using the textfile collector for simple use cases
- Design dashboards that visualize job performance over time
Exercises
- Set up a Pushgateway and monitor a simple batch job that counts files in a directory
- Create a PromQL query to calculate the success rate of a batch job over the last 7 days
- Implement monitoring for a cron job using the textfile collector approach
- Create a Grafana dashboard showing batch job performance metrics
- Set up alerts for batch jobs that fail more than twice in a row
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)