Server Monitoring with Prometheus

Introduction

Server monitoring is a critical aspect of maintaining a healthy and reliable infrastructure. It involves tracking various system metrics such as CPU usage, memory consumption, disk I/O, and network traffic to ensure optimal performance and early detection of potential issues. Prometheus, as a powerful monitoring and alerting toolkit, excels at server monitoring tasks by providing robust data collection, storage, and querying capabilities.

In this guide, we'll explore how to set up Prometheus for effective server monitoring, understand key metrics to track, and learn how to create meaningful dashboards and alerts for your infrastructure.

Why Monitor Servers with Prometheus?

Before diving into the implementation details, let's understand why Prometheus is an excellent choice for server monitoring:

Pull-based Architecture: Prometheus pulls metrics from targets, simplifying firewall configurations and providing better control over monitoring.
Time Series Database: Built-in storage optimized for time-series data with efficient compression.
Powerful Query Language (PromQL): Enables complex data analysis and aggregation.
Service Discovery: Automatically discovers and monitors new servers in dynamic environments.
Alerting Capabilities: Integrated alerting with AlertManager for proactive issue detection.
Extensive Ecosystem: Rich collection of exporters and integrations for various systems.

Setting Up Node Exporter

The first step in server monitoring with Prometheus is to deploy Node Exporter on each server you want to monitor. Node Exporter is a Prometheus exporter that collects a wide range of hardware and OS metrics.

Installing Node Exporter

On Linux systems, you can install Node Exporter as follows:

# Download Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz

# Extract the archive
tar xvfz node_exporter-1.5.0.linux-amd64.tar.gz

# Move to /usr/local/bin
sudo mv node_exporter-1.5.0.linux-amd64/node_exporter /usr/local/bin/

# Create a Node Exporter service
sudo useradd -rs /bin/false node_exporter

# Create a systemd service file
sudo tee /etc/systemd/system/node_exporter.service > /dev/null << EOF
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Once installed, Node Exporter will expose metrics on port 9100 (by default) at the /metrics endpoint.

Configuring Prometheus to Scrape Node Exporter

Add the following job to your prometheus.yml configuration file:

scrape_configs:
  - job_name: 'node'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:9100', 'server2:9100', 'server3:9100']

For dynamic environments, you might want to use service discovery instead:

scrape_configs:
  - job_name: 'node'
    scrape_interval: 15s
    consul_sd_configs:
      - server: 'localhost:8500'
        services: ['node-exporter']

Key Server Metrics to Monitor

Node Exporter provides hundreds of metrics. Here are some of the most important categories to focus on:

CPU Metrics

# CPU usage percentage per mode (user, system, idle, etc.)
rate(node_cpu_seconds_total{mode!="idle"}[1m])

# Overall CPU usage (all modes except idle)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)

# CPU load averages
node_load1, node_load5, node_load15

Memory Metrics

# Available memory
node_memory_MemAvailable_bytes

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Swap usage
node_memory_SwapUsed_bytes / node_memory_SwapTotal_bytes

Disk Metrics

# Disk space usage percentage
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100

# Disk I/O operations
rate(node_disk_reads_completed_total[1m])
rate(node_disk_writes_completed_total[1m])

# Disk I/O bytes
rate(node_disk_read_bytes_total[1m])
rate(node_disk_written_bytes_total[1m])

Network Metrics

# Network traffic
rate(node_network_receive_bytes_total[1m])
rate(node_network_transmit_bytes_total[1m])

# Network errors
rate(node_network_receive_errs_total[1m])
rate(node_network_transmit_errs_total[1m])

Creating a Server Monitoring Dashboard

A well-designed dashboard helps visualize server performance at a glance. You can use Grafana to create dashboards for your Prometheus metrics.

Here's a basic dashboard structure:

System Overview Panel:
- CPU, Memory, Disk, and Network usage at a glance
- System uptime
- Number of processes
CPU Panel:
- CPU usage over time
- CPU usage by mode (user, system, iowait)
- Load average
Memory Panel:
- Memory usage over time
- Swap usage
- Memory by type (used, cached, buffers)
Disk Panel:
- Disk space usage by mount point
- Disk I/O operations
- Disk latency
Network Panel:
- Network traffic in/out
- Network packets in/out
- Network errors

Here's a Grafana dashboard JSON example for a system overview panel:

{
  "panels": [
    {
      "title": "CPU Usage",
      "type": "graph",
      "targets": [
        {
          "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[1m])) * 100)"
        }
      ]
    },
    {
      "title": "Memory Usage",
      "type": "graph",
      "targets": [
        {
          "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
        }
      ]
    }
  ]
}

Setting Up Alerts

Prometheus allows you to set up alerts to proactively notify you of potential issues. Here are some common server monitoring alerts:

Alert Rules Configuration

Create a file named server_alerts.yml:

groups:
- name: server_alerts
  rules:
  - alert: HighCPULoad
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU load (instance {{ $labels.instance }})"
      description: "CPU load is above 80% for 5 minutes
  VALUE = {{ $value }}
  LABELS = {{ $labels }}"

  - alert: HighMemoryUsage
    expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage (instance {{ $labels.instance }})"
      description: "Memory usage is above 90% for 5 minutes
  VALUE = {{ $value }}
  LABELS = {{ $labels }}"

  - alert: DiskSpaceRunningOut
    expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Disk space running out (instance {{ $labels.instance }})"
      description: "Disk usage is above 85%
  VALUE = {{ $value }}
  LABELS = {{ $labels }}"

Then, include this file in your prometheus.yml:

rule_files:
  - "server_alerts.yml"

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

Advanced Server Monitoring with Prometheus

Beyond basic metrics, you can implement more advanced monitoring techniques:

Process Monitoring

Monitor specific processes using the process-exporter:

# Install process-exporter
wget https://github.com/ncabatoff/process-exporter/releases/download/v0.7.10/process-exporter-0.7.10.linux-amd64.tar.gz
tar xvfz process-exporter-0.7.10.linux-amd64.tar.gz
sudo mv process-exporter-0.7.10.linux-amd64/process-exporter /usr/local/bin/

Create a configuration file:

process_names:
  - name: "{{.Matches}}"
    cmdline:
    - 'nginx'
  - name: "{{.Matches}}"
    cmdline:
    - 'postgres'

Then configure Prometheus to scrape it.

Hardware Monitoring

For hardware-level metrics, you can use various specialized exporters:

IPMI Exporter: For server hardware metrics
SNMP Exporter: For network devices
Nvidia DCGM Exporter: For GPU metrics

Log Monitoring

Combine Prometheus with tools like Loki for log monitoring:

scrape_configs:
  - job_name: loki
    static_configs:
      - targets: ['loki:3100']

Monitoring Architecture Visualization

Here's a diagram showing how the components fit together:

Practical Example: Complete Monitoring Setup

Let's walk through a complete example of setting up server monitoring for a web application:

Prerequisites:
- Multiple servers running Linux
- Docker installed (optional, for containerized setup)
Step 1: Install Node Exporter on all servers (as shown earlier)
Step 2: Set up Prometheus with the following configuration:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "server_alerts.yml"

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['server1:9100', 'server2:9100', 'server3:9100']
        labels:
          environment: production
      - targets: ['dev-server:9100']
        labels:
          environment: development

Step 3: Set up AlertManager:

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR_SLACK_WEBHOOK'

route:
  group_by: ['alertname', 'instance', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications'
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty-notifications'

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#monitoring'
    send_resolved: true
    title: "{{ .GroupLabels.alertname }}"
    text: "{{ range .Alerts }}{{ .Annotations.description }}
{{ end }}"

- name: 'pagerduty-notifications'
  pagerduty_configs:
  - service_key: YOUR_PAGERDUTY_KEY

Step 4: Set up Grafana and create a dashboard
Step 5: Create custom recording rules for frequently used queries:

groups:
- name: recording_rules
  rules:
  - record: instance:node_cpu_usage:rate5m
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
  
  - record: instance:node_memory_usage:percent
    expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Troubleshooting Common Issues

When setting up server monitoring with Prometheus, you might encounter these common issues:

Node Exporter Not Accessible:
- Check if the Node Exporter service is running: systemctl status node_exporter
- Verify firewall rules allow access to port 9100
- Test the endpoint directly: curl http://localhost:9100/metrics
Missing Metrics:
- Ensure the server's time is synchronized using NTP
- Check if the Prometheus scrape job is correctly configured
- Look for errors in Prometheus logs: journalctl -u prometheus
High Cardinality Issues:
- Be cautious with metrics that have many unique label combinations
- Use recording rules to pre-aggregate data
- Consider using topk or bottomk in your queries

Summary

Server monitoring with Prometheus provides a robust foundation for maintaining reliable infrastructure. By collecting and analyzing key metrics, you can:

Detect Issues Early: Identify problems before they affect users
Optimize Performance: Find bottlenecks and improve efficiency
Plan Capacity: Make informed decisions about scaling resources
Enhance Security: Detect unusual patterns that might indicate security issues

This guide covered the basics of setting up Node Exporter, configuring Prometheus, creating dashboards, and setting up alerts. As you become more comfortable with these concepts, you can extend your monitoring setup to include more specialized metrics and advanced analysis techniques.

Additional Resources

To continue learning about server monitoring with Prometheus, check out these resources:

Exercises

Basic Setup: Install Node Exporter on a test server and configure Prometheus to scrape it.
Dashboard Creation: Create a Grafana dashboard showing the four golden signals (latency, traffic, errors, and saturation).
Alert Configuration: Set up alerts for high CPU, memory, disk usage, and unreachable targets.
Custom Metrics: Use the textfile collector to expose custom metrics from your application.
Recording Rules: Create recording rules for frequently used queries to improve performance.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Monitor Servers with Prometheus?​

Setting Up Node Exporter​

Installing Node Exporter​

Configuring Prometheus to Scrape Node Exporter​

Key Server Metrics to Monitor​

CPU Metrics​

Memory Metrics​

Disk Metrics​

Network Metrics​

Creating a Server Monitoring Dashboard​

Setting Up Alerts​

Alert Rules Configuration​

Advanced Server Monitoring with Prometheus​

Process Monitoring​

Hardware Monitoring​

Log Monitoring​

Monitoring Architecture Visualization​

Practical Example: Complete Monitoring Setup​

Troubleshooting Common Issues​

Summary​

Additional Resources​

Exercises​