Batch Job Monitoring

Introduction

Batch jobs are programs that run at scheduled intervals without user interaction. Examples include data processing tasks, backup scripts, log rotations, and scheduled cleanups. Unlike long-running services, batch jobs execute for a limited time and then terminate, creating unique monitoring challenges.

In this guide, we'll explore how to effectively monitor batch jobs using Prometheus, addressing common challenges like:

Detecting job failures
Measuring execution time
Tracking success/failure rates
Alerting when jobs don't run on schedule
Monitoring resource consumption

Why Batch Job Monitoring is Challenging

Traditional monitoring approaches often focus on long-running services where metrics are continuously available. Batch jobs present several challenges:

Intermittent execution: Jobs only emit metrics while running
Ephemeral nature: Once a job completes, its process (and metrics) disappear
Schedule adherence: Need to detect if a job didn't run at all
Binary success/failure: Often need to track outcomes rather than states

Prometheus Approach to Batch Job Monitoring

Prometheus offers several patterns for effectively monitoring batch jobs:

1. The Pushgateway Pattern

The Prometheus Pushgateway is a component that allows ephemeral jobs to push their metrics to an intermediate service where Prometheus can scrape them later.

Setting Up the Pushgateway

First, install and run the Pushgateway:

bash
# Using Docker
docker run -d -p 9091:9091 prom/pushgateway

Configure Prometheus to scrape metrics from the Pushgateway:

yaml
scrape_configs:
  - job_name: 'pushgateway'
    scrape_interval: 10s
    honor_labels: true  # Important: Preserves job labels from pushed metrics
    static_configs:
      - targets: ['localhost:9091']

2. Key Metrics for Batch Jobs

Success/Failure Counters

For tracking job outcomes:

python
from prometheus_client import Counter, push_to_gateway

# Create counters
success_counter = Counter('batch_job_success_total', 'Number of successful job runs', ['job_name'])
failure_counter = Counter('batch_job_failure_total', 'Number of failed job runs', ['job_name'])

# Example batch job
job_name = 'data_processor'
try:
    # Job logic here
    # ...
    
    # If successful
    success_counter.labels(job_name=job_name).inc()
    push_to_gateway('localhost:9091', job='batch-jobs', registry=None)
except Exception as e:
    # If failed
    failure_counter.labels(job_name=job_name).inc()
    push_to_gateway('localhost:9091', job='batch-jobs', registry=None)
    raise e

Execution Time

Measure how long your jobs take to run:

python
from prometheus_client import Gauge, push_to_gateway
import time

# Create a gauge for execution time
execution_time = Gauge('batch_job_duration_seconds', 'Duration of batch job execution', ['job_name'])

# Example batch job with timing
job_name = 'nightly_backup'
start_time = time.time()

try:
    # Job logic here
    # ...
    
    # Record execution time
    duration = time.time() - start_time
    execution_time.labels(job_name=job_name).set(duration)
    push_to_gateway('localhost:9091', job='batch-jobs', registry=None)
except Exception as e:
    # Still record execution time on failure
    duration = time.time() - start_time
    execution_time.labels(job_name=job_name).set(duration)
    push_to_gateway('localhost:9091', job='batch-jobs', registry=None)
    raise e

Last Execution Timestamp

Track when the job last ran:

python
from prometheus_client import Gauge, push_to_gateway
import time

# Create a gauge for last execution timestamp
last_execution = Gauge('batch_job_last_execution_timestamp', 'Timestamp of last execution', ['job_name'])

job_name = 'hourly_data_sync'
# Set to current UNIX timestamp
last_execution.labels(job_name=job_name).set_to_current_time()
push_to_gateway('localhost:9091', job='batch-jobs', registry=None)

3. Monitoring Batch Jobs with Cron Exporter

For cron-based jobs, you can use a dedicated exporter like prometheus-cron-exporter or the Node Exporter's textfile collector.

Example using textfile collector:

Create a script that runs alongside your job:

bash
#!/bin/bash

JOB_NAME="database_backup"
METRICS_DIR="/var/lib/node_exporter/textfile_collector"

# Run before the job
echo "# HELP batch_job_start_time_seconds Timestamp when the job started" > $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "# TYPE batch_job_start_time_seconds gauge" >> $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "batch_job_start_time_seconds{job_name=\"${JOB_NAME}\"} $(date +%s)" >> $METRICS_DIR/${JOB_NAME}.prom.tmp

# Run the actual job
START_TIME=$(date +%s)
if /path/to/your/batch/job; then
  EXIT_CODE=0
else
  EXIT_CODE=$?
fi
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))

# After the job completes, record metrics
echo "# HELP batch_job_duration_seconds Duration of the job" > $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "# TYPE batch_job_duration_seconds gauge" >> $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "batch_job_duration_seconds{job_name=\"${JOB_NAME}\"} ${DURATION}" >> $METRICS_DIR/${JOB_NAME}.prom.tmp

echo "# HELP batch_job_last_execution_timestamp_seconds Timestamp of the last execution" >> $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "# TYPE batch_job_last_execution_timestamp_seconds gauge" >> $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "batch_job_last_execution_timestamp_seconds{job_name=\"${JOB_NAME}\"} ${END_TIME}" >> $METRICS_DIR/${JOB_NAME}.prom.tmp

echo "# HELP batch_job_exit_code Exit code of the last execution" >> $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "# TYPE batch_job_exit_code gauge" >> $METRICS_DIR/${JOB_NAME}.prom.tmp
echo "batch_job_exit_code{job_name=\"${JOB_NAME}\"} ${EXIT_CODE}" >> $METRICS_DIR/${JOB_NAME}.prom.tmp

mv $METRICS_DIR/${JOB_NAME}.prom.tmp $METRICS_DIR/${JOB_NAME}.prom

Configure Node Exporter to read these metrics:

bash
node_exporter --collector.textfile.directory=/var/lib/node_exporter/textfile_collector

Creating Effective Alerts for Batch Jobs

1. Alert on Job Failures

yaml
groups:
- name: batch-jobs
  rules:
  - alert: BatchJobFailed
    expr: increase(batch_job_failure_total{job_name=~".*"}[1h]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Batch job {{ $labels.job_name }} failed"
      description: "The batch job {{ $labels.job_name }} has failed at least once in the last hour."

2. Alert on Missing Jobs

yaml
groups:
- name: batch-jobs
  rules:
  - alert: BatchJobMissed
    expr: time() - batch_job_last_execution_timestamp{job_name="hourly_data_sync"} > 3600 * 1.5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Batch job {{ $labels.job_name }} missed its schedule"
      description: "The batch job {{ $labels.job_name }} hasn't run in the last 1.5x its expected interval."

3. Alert on Unusually Long Execution Times

yaml
groups:
- name: batch-jobs
  rules:
  - alert: BatchJobTooSlow
    expr: batch_job_duration_seconds{job_name="nightly_backup"} > 3600
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Batch job {{ $labels.job_name }} running too slow"
      description: "The batch job {{ $labels.job_name }} took longer than expected to complete."

Real-World Example: Database Backup Monitoring

Let's implement a complete example for monitoring database backups:

python
#!/usr/bin/env python3
import time
import subprocess
import sys
from prometheus_client import Counter, Gauge, push_to_gateway

# Define metrics
success = Counter('db_backup_success_total', 'Successful DB backups', ['database'])
failure = Counter('db_backup_failure_total', 'Failed DB backups', ['database', 'error_type'])
duration = Gauge('db_backup_duration_seconds', 'Time taken for backup', ['database'])
size = Gauge('db_backup_size_bytes', 'Size of the backup file', ['database'])
last_run = Gauge('db_backup_last_run_timestamp', 'When the backup last ran', ['database'])

# Database to back up
db_name = 'production_db'

# Record start time and last run timestamp
start_time = time.time()
last_run.labels(database=db_name).set_to_current_time()

try:
    # Run the backup command
    process = subprocess.run(
        ['pg_dump', '-U', 'postgres', '-d', db_name, '-f', f'/backups/{db_name}.sql'],
        check=True,
        capture_output=True
    )
    
    # Get backup file size
    file_size = subprocess.run(
        ['stat', '-c', '%s', f'/backups/{db_name}.sql'],
        check=True,
        capture_output=True,
        text=True
    )
    
    # Record metrics for success
    backup_size = int(file_size.stdout.strip())
    success.labels(database=db_name).inc()
    size.labels(database=db_name).set(backup_size)
    
    print(f"Backup completed successfully. Size: {backup_size} bytes")
    
except subprocess.CalledProcessError as e:
    # Determine error type
    if 'connection' in e.stderr.lower():
        error_type = 'connection_error'
    elif 'permission' in e.stderr.lower():
        error_type = 'permission_error'
    else:
        error_type = 'unknown_error'
    
    # Record failure
    failure.labels(database=db_name, error_type=error_type).inc()
    print(f"Backup failed: {e}", file=sys.stderr)
    sys.exit(1)
    
finally:
    # Always record duration
    backup_duration = time.time() - start_time
    duration.labels(database=db_name).set(backup_duration)
    
    # Push all metrics to Pushgateway
    push_to_gateway('localhost:9091', job='database_backup', registry=None)

Creating a Grafana Dashboard for Batch Jobs

Here's a sample Grafana dashboard query to visualize batch job performance:

# Success rate over time
sum(increase(batch_job_success_total{job_name="data_processor"}[1d])) / 
(sum(increase(batch_job_success_total{job_name="data_processor"}[1d])) + 
sum(increase(batch_job_failure_total{job_name="data_processor"}[1d])))

# Execution time trends
batch_job_duration_seconds{job_name="nightly_backup"}

# Job execution heatmap
batch_job_last_execution_timestamp_seconds{job_name="hourly_data_sync"}

Advanced Techniques

1. Using the Textfile Collector with Synthetic Metrics

For extremely simple batch jobs, you can generate "event-based" metrics using timestamps:

bash
#!/bin/bash
echo "batch_job_completed{job=\"log_rotation\",status=\"success\"} $(date +%s)" > /var/lib/node_exporter/textfile_collector/log_rotation.prom

Then in Prometheus, you can use time-based functions:

# Alert if no successful completions in the last day
time() - batch_job_completed{job="log_rotation",status="success"} > 86400

2. Multi-Instance Batch Job Monitoring

When running the same batch job across multiple servers:

python
from prometheus_client import Counter, Gauge, push_to_gateway
import socket

# Include hostname in labels
hostname = socket.gethostname()
job_name = 'daily_report_generation'

success_counter = Counter(
    'batch_job_success_total', 
    'Number of successful job runs', 
    ['job_name', 'hostname']
)

# Increment with hostname
success_counter.labels(job_name=job_name, hostname=hostname).inc()
push_to_gateway('central-pushgateway:9091', job='batch-jobs', registry=None)

Summary

Monitoring batch jobs with Prometheus requires different approaches than monitoring long-running services. By using the Pushgateway, appropriate metrics, and well-crafted alerts, you can gain visibility into job health, performance, and reliability.

Key takeaways:

Use Pushgateway for ephemeral batch job metrics
Track success/failure counters, execution time, and last run time
Create alerts for failures, missed schedules, and performance issues
Consider using the textfile collector for simple use cases
Design dashboards that visualize job performance over time

Exercises

Set up a Pushgateway and monitor a simple batch job that counts files in a directory
Create a PromQL query to calculate the success rate of a batch job over the last 7 days
Implement monitoring for a cron job using the textfile collector approach
Create a Grafana dashboard showing batch job performance metrics
Set up alerts for batch jobs that fail more than twice in a row

Additional Resources

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Batch Job Monitoring is Challenging​

Prometheus Approach to Batch Job Monitoring​

1. The Pushgateway Pattern​

Setting Up the Pushgateway​

2. Key Metrics for Batch Jobs​

Success/Failure Counters​

Execution Time​

Last Execution Timestamp​

3. Monitoring Batch Jobs with Cron Exporter​

Creating Effective Alerts for Batch Jobs​

1. Alert on Job Failures​

2. Alert on Missing Jobs​

3. Alert on Unusually Long Execution Times​

Real-World Example: Database Backup Monitoring​

Creating a Grafana Dashboard for Batch Jobs​

Advanced Techniques​

1. Using the Textfile Collector with Synthetic Metrics​

2. Multi-Instance Batch Job Monitoring​

Summary​

Exercises​

Additional Resources​