Ansible Infrastructure Monitoring
Introduction
Infrastructure monitoring is a critical aspect of maintaining healthy, performant systems in modern IT environments. As infrastructure scales, manual monitoring becomes impractical and error-prone. This is where Ansible shines - it allows you to automate the deployment, configuration, and maintenance of monitoring solutions across your entire infrastructure.
In this guide, we'll explore how to use Ansible to implement monitoring solutions that help you maintain visibility into your systems, detect issues proactively, and respond to problems efficiently.
Understanding Infrastructure Monitoring with Ansible
Infrastructure monitoring involves collecting metrics, logs, and status information from various components of your IT environment. Ansible can automate this process by:
- Installing and configuring monitoring agents and collectors
- Deploying central monitoring servers
- Setting up dashboards and visualization tools
- Configuring alerting mechanisms
- Creating regular maintenance and update workflows
Let's dive into how we can implement these capabilities using Ansible's automation framework.
Setting Up Your Environment
Before we begin, ensure you have:
- Ansible installed (version 2.9+)
- SSH access to your target servers
- Appropriate privileges to install software
- A basic inventory file
Here's a simple inventory structure for our examples:
[monitoring_servers]
monitor01.example.com
[web_servers]
web01.example.com
web02.example.com
[database_servers]
db01.example.com
[all:vars]
ansible_user=ansible
Deploying Prometheus with Ansible
Prometheus is a popular open-source monitoring system that works well with Ansible automation. Let's create a playbook to deploy Prometheus.
Step 1: Create the Prometheus Role Structure
First, let's set up a role for Prometheus:
ansible-galaxy init roles/prometheus
This creates a standard role directory structure:
roles/prometheus/
├── defaults
│ └── main.yml
├── files
├── handlers
│ └── main.yml
├── meta
│ └── main.yml
├── tasks
│ └── main.yml
├── templates
│ └── prometheus.yml.j2
└── vars
└── main.yml
Step 2: Define Prometheus Configuration Variables
In roles/prometheus/defaults/main.yml
:
---
prometheus_version: 2.36.0
prometheus_web_listen_address: "0.0.0.0:9090"
prometheus_storage_retention: "15d"
prometheus_scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
scrape_interval: 15s
static_configs:
- targets: ['localhost:9100']
Step 3: Create Prometheus Configuration Template
In roles/prometheus/templates/prometheus.yml.j2
:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
{% for scrape_config in prometheus_scrape_configs %}
- job_name: {{ scrape_config.job_name | quote }}
scrape_interval: {{ scrape_config.scrape_interval }}
static_configs:
{% for static_config in scrape_config.static_configs %}
- targets: {{ static_config.targets }}
{% endfor %}
{% endfor %}
Step 4: Define Installation Tasks
In roles/prometheus/tasks/main.yml
:
---
- name: Create Prometheus system group
group:
name: prometheus
state: present
system: true
- name: Create Prometheus system user
user:
name: prometheus
group: prometheus
system: true
shell: /sbin/nologin
create_home: false
- name: Create Prometheus directories
file:
path: "{{ item }}"
state: directory
owner: prometheus
group: prometheus
mode: 0755
with_items:
- /etc/prometheus
- /var/lib/prometheus
- name: Download and extract Prometheus binary
unarchive:
src: "https://github.com/prometheus/prometheus/releases/download/v{{ prometheus_version }}/prometheus-{{ prometheus_version }}.linux-amd64.tar.gz"
dest: /tmp
remote_src: yes
creates: "/tmp/prometheus-{{ prometheus_version }}.linux-amd64"
- name: Copy Prometheus binaries
copy:
src: "/tmp/prometheus-{{ prometheus_version }}.linux-amd64/{{ item }}"
dest: "/usr/local/bin/{{ item }}"
remote_src: yes
owner: prometheus
group: prometheus
mode: 0755
with_items:
- prometheus
- promtool
- name: Create Prometheus configuration
template:
src: prometheus.yml.j2
dest: /etc/prometheus/prometheus.yml
owner: prometheus
group: prometheus
mode: 0644
notify: restart prometheus
- name: Create Prometheus systemd service
template:
src: prometheus.service.j2
dest: /etc/systemd/system/prometheus.service
owner: root
group: root
mode: 0644
notify: restart prometheus
- name: Enable and start Prometheus service
systemd:
name: prometheus
state: started
enabled: yes
daemon_reload: yes
Step 5: Create the Service Template
In roles/prometheus/templates/prometheus.service.j2
:
[Unit]
Description=Prometheus Time Series Collection and Processing Server
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time={{ prometheus_storage_retention }} \
--web.listen-address={{ prometheus_web_listen_address }} \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries
Restart=always
[Install]
WantedBy=multi-user.target
Step 6: Define Handlers
In roles/prometheus/handlers/main.yml
:
---
- name: restart prometheus
systemd:
name: prometheus
state: restarted
Step 7: Create the Main Playbook
Now let's create the main playbook to use our role:
---
- name: Deploy Prometheus Monitoring Server
hosts: monitoring_servers
become: true
roles:
- prometheus
This playbook can be executed with:
ansible-playbook -i inventory prometheus_setup.yml
Deploying Node Exporter for System Metrics
To monitor your infrastructure, you need to collect metrics from individual nodes. Prometheus's Node Exporter is perfect for this:
Step 1: Create Node Exporter Role
ansible-galaxy init roles/node_exporter
Step 2: Define Node Exporter Tasks
In roles/node_exporter/tasks/main.yml
:
---
- name: Create Node Exporter system group
group:
name: node_exporter
state: present
system: true
- name: Create Node Exporter system user
user:
name: node_exporter
group: node_exporter
system: true
shell: /sbin/nologin
create_home: false
- name: Download and extract Node Exporter
unarchive:
src: "https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz"
dest: /tmp
remote_src: yes
creates: "/tmp/node_exporter-1.3.1.linux-amd64"
- name: Copy Node Exporter binary
copy:
src: "/tmp/node_exporter-1.3.1.linux-amd64/node_exporter"
dest: "/usr/local/bin/node_exporter"
remote_src: yes
owner: node_exporter
group: node_exporter
mode: 0755
- name: Create Node Exporter systemd service
template:
src: node_exporter.service.j2
dest: /etc/systemd/system/node_exporter.service
owner: root
group: root
mode: 0644
notify: restart node_exporter
- name: Enable and start Node Exporter service
systemd:
name: node_exporter
state: started
enabled: yes
daemon_reload: yes
Step 3: Create Service Template
In roles/node_exporter/templates/node_exporter.service.j2
:
[Unit]
Description=Node Exporter
Documentation=https://github.com/prometheus/node_exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=always
[Install]
WantedBy=multi-user.target
Step 4: Define Handlers
In roles/node_exporter/handlers/main.yml
:
---
- name: restart node_exporter
systemd:
name: node_exporter
state: restarted
Step 5: Create Node Exporter Playbook
---
- name: Deploy Node Exporter to All Servers
hosts: all:!monitoring_servers
become: true
roles:
- node_exporter
Run the playbook:
ansible-playbook -i inventory node_exporter_setup.yml
Step 6: Update Prometheus Configuration
Now we need to update Prometheus to scrape metrics from our nodes. Let's modify our Prometheus configuration:
- name: Configure Prometheus to monitor all nodes
hosts: monitoring_servers
become: true
vars:
prometheus_scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
scrape_interval: 15s
static_configs:
- targets:
- 'web01.example.com:9100'
- 'web02.example.com:9100'
- 'db01.example.com:9100'
roles:
- prometheus
Setting Up Grafana for Visualization
Prometheus collects data, but Grafana provides beautiful visualizations. Let's set it up:
Step 1: Create Grafana Role
ansible-galaxy init roles/grafana
Step 2: Define Grafana Tasks
In roles/grafana/tasks/main.yml
:
---
- name: Add Grafana GPG key
apt_key:
url: https://packages.grafana.com/gpg.key
state: present
when: ansible_os_family == "Debian"
- name: Add Grafana repository (Debian/Ubuntu)
apt_repository:
repo: "deb https://packages.grafana.com/oss/deb stable main"
state: present
when: ansible_os_family == "Debian"
- name: Install Grafana package (Debian/Ubuntu)
apt:
name: grafana
state: present
update_cache: yes
when: ansible_os_family == "Debian"
notify: restart grafana
- name: Add Grafana repository (RedHat/CentOS)
yum_repository:
name: grafana
description: Grafana repository
baseurl: https://packages.grafana.com/oss/rpm
gpgcheck: 1
gpgkey: https://packages.grafana.com/gpg.key
when: ansible_os_family == "RedHat"
- name: Install Grafana package (RedHat/CentOS)
yum:
name: grafana
state: present
when: ansible_os_family == "RedHat"
notify: restart grafana
- name: Configure Grafana datasources
template:
src: datasource.yml.j2
dest: /etc/grafana/provisioning/datasources/default.yml
owner: grafana
group: grafana
mode: 0644
notify: restart grafana
- name: Enable and start Grafana service
systemd:
name: grafana-server
state: started
enabled: yes
Step 3: Create Datasource Template
In roles/grafana/templates/datasource.yml.j2
:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://localhost:9090
isDefault: true
editable: false
Step 4: Define Handlers
In roles/grafana/handlers/main.yml
:
---
- name: restart grafana
systemd:
name: grafana-server
state: restarted
Step 5: Create Grafana Playbook
---
- name: Deploy Grafana Visualization Platform
hosts: monitoring_servers
become: true
roles:
- grafana
Run the playbook:
ansible-playbook -i inventory grafana_setup.yml
Automating Alerts with Ansible
Monitoring is incomplete without alerts. Let's configure Prometheus AlertManager:
Step 1: Create AlertManager Role
ansible-galaxy init roles/alertmanager
Step 2: Define AlertManager Tasks
In roles/alertmanager/tasks/main.yml
:
---
- name: Create AlertManager system group
group:
name: alertmanager
state: present
system: true
- name: Create AlertManager system user
user:
name: alertmanager
group: alertmanager
system: true
shell: /sbin/nologin
create_home: false
- name: Create AlertManager directories
file:
path: "{{ item }}"
state: directory
owner: alertmanager
group: alertmanager
mode: 0755
with_items:
- /etc/alertmanager
- /var/lib/alertmanager
- name: Download and extract AlertManager binary
unarchive:
src: "https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz"
dest: /tmp
remote_src: yes
creates: "/tmp/alertmanager-0.24.0.linux-amd64"
- name: Copy AlertManager binaries
copy:
src: "/tmp/alertmanager-0.24.0.linux-amd64/{{ item }}"
dest: "/usr/local/bin/{{ item }}"
remote_src: yes
owner: alertmanager
group: alertmanager
mode: 0755
with_items:
- alertmanager
- amtool
- name: Create AlertManager configuration
template:
src: alertmanager.yml.j2
dest: /etc/alertmanager/alertmanager.yml
owner: alertmanager
group: alertmanager
mode: 0644
notify: restart alertmanager
- name: Create AlertManager systemd service
template:
src: alertmanager.service.j2
dest: /etc/systemd/system/alertmanager.service
owner: root
group: root
mode: 0644
notify: restart alertmanager
- name: Enable and start AlertManager service
systemd:
name: alertmanager
state: started
enabled: yes
daemon_reload: yes
Step 3: Create Service and Configuration Templates
In roles/alertmanager/templates/alertmanager.service.j2
:
[Unit]
Description=AlertManager
Documentation=https://github.com/prometheus/alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager
Restart=always
[Install]
WantedBy=multi-user.target
In roles/alertmanager/templates/alertmanager.yml.j2
:
global:
resolve_timeout: 5m
slack_api_url: '{{ alertmanager_slack_webhook_url | default("") }}'
route:
group_by: ['alertname', 'job']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'email-notifications'
receivers:
- name: 'email-notifications'
email_configs:
- to: '{{ alertmanager_email | default("[email protected]") }}'
from: '{{ alertmanager_email_from | default("[email protected]") }}'
smarthost: '{{ alertmanager_email_smarthost | default("smtp.example.com:587") }}'
auth_username: '{{ alertmanager_email_username | default("") }}'
auth_password: '{{ alertmanager_email_password | default("") }}'
require_tls: {{ alertmanager_email_require_tls | default(true) }}
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Step 4: Define Handlers
In roles/alertmanager/handlers/main.yml
:
---
- name: restart alertmanager
systemd:
name: alertmanager
state: restarted
Step 5: Update Prometheus for AlertManager
Now let's modify our Prometheus configuration to work with AlertManager. Add the following to roles/prometheus/templates/prometheus.yml.j2
:
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- "/etc/prometheus/rules/*.yml"
Step 6: Create Alert Rules
First, create a directory for rules in the Prometheus role:
- name: Create rules directory
file:
path: /etc/prometheus/rules
state: directory
owner: prometheus
group: prometheus
mode: 0755
Then create a template for rules in roles/prometheus/templates/node_alerts.yml.j2
:
groups:
- name: node_alerts
rules:
- alert: HighCPULoad
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU load (instance {{ $labels.instance }})"
description: "CPU load is > 80%
VALUE = {{ $value }}
LABELS = {{ $labels }}"
- alert: HighMemoryLoad
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High memory load (instance {{ $labels.instance }})"
description: "Memory load is > 80%
VALUE = {{ $value }}
LABELS = {{ $labels }}"
- alert: HighDiskUsage
expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High disk usage (instance {{ $labels.instance }})"
description: "Disk usage is > 85%
VALUE = {{ $value }}
LABELS = {{ $labels }}"
Add a task to copy this file:
- name: Copy alert rules
template:
src: node_alerts.yml.j2
dest: /etc/prometheus/rules/node_alerts.yml
owner: prometheus
group: prometheus
mode: 0644
notify: restart prometheus
Step 7: Create Complete Monitoring Playbook
Let's create a comprehensive playbook that sets up the entire monitoring stack:
---
- name: Deploy Complete Monitoring Solution
hosts: monitoring_servers
become: true
vars:
alertmanager_email: "[email protected]"
alertmanager_email_from: "[email protected]"
alertmanager_email_smarthost: "smtp.yourdomain.com:587"
alertmanager_email_username: "[email protected]"
alertmanager_email_password: "your-password"
roles:
- prometheus
- alertmanager
- grafana
- name: Deploy Node Exporter to All Servers
hosts: all:!monitoring_servers
become: true
roles:
- node_exporter
Run the complete monitoring setup:
ansible-playbook -i inventory monitoring_setup.yml
Automating Monitoring Checks with Ansible
Beyond setting up the monitoring infrastructure, Ansible can also perform ad-hoc monitoring checks:
Creating a Monitoring Check Playbook
---
- name: Perform infrastructure health checks
hosts: all
become: true
tasks:
- name: Check disk space
command: df -h
register: df_output
changed_when: false
- name: Check memory usage
command: free -m
register: free_output
changed_when: false
- name: Check load average
command: uptime
register: uptime_output
changed_when: false
- name: Check for failed services
command: systemctl --failed
register: failed_services
changed_when: false
- name: Display health check results
debug:
msg:
- "Disk space: {{ df_output.stdout_lines }}"
- "Memory usage: {{ free_output.stdout_lines }}"
- "Load average: {{ uptime_output.stdout }}"
- "Failed services: {{ failed_services.stdout_lines }}"
- name: Alert on critical disk usage
fail:
msg: "Critical disk usage detected on {{ ansible_hostname }}"
when: df_output.stdout is search('([8-9][0-9]|100)%')
This playbook can be run on-demand or scheduled via cron to perform health checks.
Visualizing Your Monitoring Architecture
Let's visualize our complete monitoring architecture:
Dynamic Inventory for Monitoring
For large, dynamic environments, you can use Ansible's dynamic inventory to automatically discover and monitor new servers:
---
- name: Update Prometheus targets from dynamic inventory
hosts: monitoring_servers
become: true
tasks:
- name: Get all hosts from inventory
set_fact:
all_hosts: "{{ groups['all'] | difference(groups['monitoring_servers']) }}"
- name: Template Prometheus configuration with dynamic targets
template:
src: prometheus_dynamic.yml.j2
dest: /etc/prometheus/prometheus.yml
owner: prometheus
group: prometheus
mode: 0644
vars:
node_targets: "{{ all_hosts | map('regex_replace', '^(.*)$', '\\1:9100') | list }}"
notify: restart prometheus
handlers:
- name: restart prometheus
systemd:
name: prometheus
state: restarted
With a template like:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "/etc/prometheus/rules/*.yml"
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
scrape_interval: 15s
static_configs:
- targets: {{ node_targets | to_nice_yaml | indent(8) }}
Best Practices for Monitoring with Ansible
When implementing infrastructure monitoring with Ansible, follow these best practices:
-
Use roles for reusability: Package your monitoring components in roles for easier reuse and sharing.
-
Separate configuration from code: Use variables and templates to separate configuration from implementation.
-
Use Ansible Vault for sensitive data: Store credentials and sensitive information in Ansible Vault:
bashansible-vault create monitoring_secrets.yml
-
Implement tags for selective execution:
yamltasks:
- name: Install Prometheus
# task details
tags: [prometheus, install]Then run with
--tags
:bashansible-playbook -i inventory monitoring.yml --tags "prometheus"
-
Implement monitoring for Ansible itself: Track the success and failure of Ansible runs using callback plugins.
-
Version your monitoring configuration: Keep your monitoring setup in git or another version control system.
-
Test monitoring in staging first: Always test your monitoring setup in a staging environment before production.
Summary
In this guide, we've explored how to use Ansible to automate infrastructure monitoring:
- We deployed Prometheus as a central metrics collection system
- We installed Node Exporter on target servers to collect system metrics
- We set up Grafana to create beautiful dashboards and visualizations
- We configured AlertManager to notify operators of problems
- We created alert rules to detect common system issues
- We built ad-hoc monitoring checks using Ansible playbooks
- We implemented dynamic inventory integration for auto-discovery
By automating your monitoring with Ansible, you can ensure consistent implementation across your infrastructure, rapidly deploy monitoring to new systems, and maintain a reliable observability platform that grows with your environment.
Additional Resources and Exercises
Further Learning
Exercises
-
Basic: Extend the Node Exporter role to collect additional metrics specific to web servers or database servers.
-
Intermediate: Create custom Grafana dashboards via Ansible for different server types (web, database, application).
-
Advanced: Implement a role that uses the Ansible Tower/AWX API to create monitoring jobs that automatically remediate common issues.
-
Challenge: Create a complete CI/CD pipeline that tests monitoring configuration before deploying it to production.
Sample Ansible Project Structure
For a complete monitoring solution, consider this project structure:
ansible-monitoring/
├── inventory/
│ ├── production
│ └── staging
├── group_vars/
│ ├── all.yml
│ ├── monitoring_servers.yml
│ └── web_servers.yml
├── host_vars/
│ └── monitor01.example.com.yml
├── roles/
│ ├── prometheus/
│ ├── node_exporter/
│ ├── alertmanager/
│ └── grafana/
├── playbooks/
│ ├── monitoring_setup.yml
│ ├── ad_hoc_checks.yml
│ └── remediation.yml
└── ansible.cfg
By following this guide, you now have the skills to implement comprehensive infrastructure monitoring using Ansible automation!
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)