Ubuntu Cloud Monitoring
Introduction
Monitoring is a critical aspect of maintaining healthy and efficient cloud environments. For Ubuntu-based cloud deployments, proper monitoring ensures optimal performance, enhances security, and helps prevent downtime. This guide explores the fundamentals of Ubuntu cloud monitoring, essential tools, implementation strategies, and best practices to help you build a robust monitoring system for your cloud infrastructure.
Why Monitor Your Ubuntu Cloud Environment?
Cloud monitoring provides visibility into your infrastructure and applications, allowing you to:
- Detect and resolve issues before they impact users
- Optimize resource utilization and reduce costs
- Ensure compliance with service level agreements (SLAs)
- Identify security threats and vulnerabilities
- Make data-driven decisions about scaling and improvements
Key Monitoring Components for Ubuntu Cloud
System-Level Metrics
At the foundation of cloud monitoring are system-level metrics that provide insight into the health and performance of your Ubuntu instances:
Metric Type | Examples | Importance |
---|---|---|
CPU | Usage percentage, load average, context switches | Indicates processing capacity and potential bottlenecks |
Memory | Free/used RAM, swap usage, page faults | Helps identify memory leaks and capacity issues |
Disk | I/O operations, free space, read/write latency | Prevents storage-related failures |
Network | Bandwidth usage, packet loss, connection count | Monitors connectivity and data transfer efficiency |
Application-Level Monitoring
Beyond system metrics, monitoring the applications running on your Ubuntu cloud instances is essential:
- Response times: How long your applications take to process requests
- Error rates: Frequency and types of errors occurring in your applications
- Request rates: Volume of traffic your applications are handling
- Saturation: How "full" your application service is
Log Management
Logs provide detailed information about events occurring within your Ubuntu cloud environment:
- System logs (
/var/log/syslog
,/var/log/auth.log
, etc.) - Application logs
- Service-specific logs (web servers, databases, etc.)
- Cloud provider logs
Essential Ubuntu Cloud Monitoring Tools
1. Prometheus
Prometheus is a powerful open-source monitoring system with a dimensional data model and flexible query language.
Installation on Ubuntu
# Update package information
sudo apt update
# Install dependencies
sudo apt install -y apt-transport-https software-properties-common
# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.42.0/prometheus-2.42.0.linux-amd64.tar.gz
# Extract the archive
tar xvf prometheus-2.42.0.linux-amd64.tar.gz
# Move to /opt directory
sudo mv prometheus-2.42.0.linux-amd64 /opt/prometheus
# Create a Prometheus user
sudo useradd -rs /bin/false prometheus
# Create directories for configuration and data
sudo mkdir -p /etc/prometheus /var/lib/prometheus
# Set ownership
sudo chown -R prometheus:prometheus /opt/prometheus /etc/prometheus /var/lib/prometheus
Basic Configuration
Create a configuration file at /etc/prometheus/prometheus.yml
:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
Create a systemd service
sudo tee /etc/systemd/system/prometheus.service > /dev/null << EOF
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--web.console.templates=/opt/prometheus/consoles \
--web.console.libraries=/opt/prometheus/console_libraries
[Install]
WantedBy=multi-user.target
EOF
Start and enable the service:
sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl enable prometheus
2. Node Exporter
Node Exporter collects system-level metrics from Ubuntu hosts.
# Download Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
# Extract the archive
tar xvf node_exporter-1.5.0.linux-amd64.tar.gz
# Move the binary
sudo mv node_exporter-1.5.0.linux-amd64/node_exporter /usr/local/bin/
# Create a Node Exporter user
sudo useradd -rs /bin/false node_exporter
# Create a systemd service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null << EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
# Start and enable the service
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter
3. Grafana
Grafana provides visualization for your monitoring data with dashboards and alerts.
# Add Grafana APT repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
# Update packages and install Grafana
sudo apt-get update
sudo apt-get install -y grafana
# Start and enable Grafana service
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
After installation, access Grafana at http://your-server-ip:3000
(default credentials: admin/admin).
4. Netdata
Netdata provides real-time monitoring with minimal configuration:
# Install dependencies
sudo apt-get install -y curl
# Download and run the installation script
bash <(curl -Ss https://my-netdata.io/kickstart.sh)
Access Netdata dashboard at http://your-server-ip:19999
.
Setting Up a Complete Monitoring Stack
The following example demonstrates how to set up a monitoring stack for multiple Ubuntu cloud instances:
Step 1: Install Monitoring Agents
Deploy Node Exporter to all Ubuntu instances:
#!/bin/bash
# deploy-node-exporter.sh
# Install Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
tar xvf node_exporter-1.5.0.linux-amd64.tar.gz
sudo mv node_exporter-1.5.0.linux-amd64/node_exporter /usr/local/bin/
sudo useradd -rs /bin/false node_exporter
# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null << EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
# Start and enable service
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter
# Open firewall port for Prometheus scraping
sudo ufw allow from prometheus-server-ip to any port 9100
Step 2: Configure Prometheus
Update the Prometheus configuration to scrape metrics from all instances:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'ubuntu_nodes'
static_configs:
- targets:
- 'instance-1:9100'
- 'instance-2:9100'
- 'instance-3:9100'
# Add more instances as needed
Step 3: Configure Alerting
Create an alert rules file at /etc/prometheus/alert_rules.yml
:
groups:
- name: ubuntu_cloud_alerts
rules:
- alert: HighCPULoad
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU load (instance {{ $labels.instance }})"
description: "CPU load is > 80% for 5 minutes
VALUE = {{ $value }}
LABELS: {{ $labels }}"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage (instance {{ $labels.instance }})"
description: "Memory usage is > 85% for 5 minutes
VALUE = {{ $value }}
LABELS: {{ $labels }}"
- alert: LowDiskSpace
expr: node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space (instance {{ $labels.instance }})"
description: "Disk space is < 10% for 5 minutes
VALUE = {{ $value }}
LABELS: {{ $labels }}"
Step 4: Set Up Alert Manager
Install and configure Alert Manager:
# Download Alert Manager
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz
# Extract the archive
tar xvf alertmanager-0.25.0.linux-amd64.tar.gz
# Move to /opt directory
sudo mv alertmanager-0.25.0.linux-amd64 /opt/alertmanager
# Create user and set permissions
sudo useradd -rs /bin/false alertmanager
sudo chown -R alertmanager:alertmanager /opt/alertmanager
Create a configuration file at /opt/alertmanager/alertmanager.yml
:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: '[email protected]'
smtp_auth_username: 'your-username'
smtp_auth_password: 'your-password'
route:
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'email-notifications'
receivers:
- name: 'email-notifications'
email_configs:
- to: '[email protected]'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Create a systemd service for Alert Manager:
sudo tee /etc/systemd/system/alertmanager.service > /dev/null << EOF
[Unit]
Description=Alert Manager
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/opt/alertmanager/alertmanager \
--config.file=/opt/alertmanager/alertmanager.yml \
--storage.path=/opt/alertmanager/data
[Install]
WantedBy=multi-user.target
EOF
Start and enable the Alert Manager service:
sudo systemctl daemon-reload
sudo systemctl start alertmanager
sudo systemctl enable alertmanager
Step 5: Configure Grafana Dashboards
After setting up Grafana, add Prometheus as a data source:
- Navigate to Configuration > Data Sources
- Click "Add data source"
- Select "Prometheus"
- Set the URL to
http://localhost:9090
(or your Prometheus server address) - Click "Save & Test"
Import pre-built dashboards for Ubuntu monitoring:
- Go to Create > Import
- Enter dashboard ID
1860
(Node Exporter Full) - Select your Prometheus data source
- Click "Import"
Monitoring Ubuntu Cloud Resources with Cloud-Native Tools
When your Ubuntu instances are running in public cloud environments, leverage cloud-native monitoring tools:
AWS CloudWatch Integration
For Ubuntu instances on AWS, install the CloudWatch agent:
# Install CloudWatch agent
wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i amazon-cloudwatch-agent.deb
# Configure the agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard
Sample configuration file for /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
:
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"metrics_collected": {
"cpu": {
"measurement": [
"cpu_usage_idle",
"cpu_usage_user",
"cpu_usage_system"
],
"resources": [
"*"
],
"totalcpu": true
},
"disk": {
"measurement": [
"used_percent",
"inodes_free"
],
"resources": [
"/"
]
},
"diskio": {
"measurement": [
"io_time",
"write_bytes",
"read_bytes",
"writes",
"reads"
],
"resources": [
"*"
]
},
"mem": {
"measurement": [
"mem_used_percent"
]
},
"netstat": {
"measurement": [
"tcp_established",
"tcp_time_wait"
]
},
"swap": {
"measurement": [
"swap_used_percent"
]
}
}
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/syslog",
"log_group_name": "syslog",
"log_stream_name": "{instance_id}"
},
{
"file_path": "/var/log/auth.log",
"log_group_name": "auth.log",
"log_stream_name": "{instance_id}"
}
]
}
}
}
}
Start the CloudWatch agent:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
Azure Monitor Integration
For Ubuntu instances on Azure, install the Log Analytics agent:
# Download and install the Log Analytics agent
wget https://raw.githubusercontent.com/Microsoft/OMS-Agent-for-Linux/master/installer/scripts/onboard_agent.sh
chmod +x onboard_agent.sh
sudo ./onboard_agent.sh -w <YOUR_WORKSPACE_ID> -s <YOUR_WORKSPACE_KEY>
Google Cloud Monitoring
For Ubuntu on Google Cloud, install the Ops Agent:
# Add the Cloud Monitoring agent repository
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install
Best Practices for Ubuntu Cloud Monitoring
1. Define Clear Monitoring Objectives
Before implementing monitoring, define what you need to monitor:
- Identify critical services and applications
- Determine key performance indicators (KPIs)
- Set thresholds for alerts based on historical data
- Document monitoring requirements
2. Implement the USE Method
The USE method helps identify performance issues by focusing on:
- Utilization: Percentage of time the resource is busy
- Saturation: Amount of work a resource has queued
- Errors: Count of error events
Apply this method to CPU, memory, disk, and network resources.
3. Implement the RED Method
For monitoring services, follow the RED method:
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Distribution of request latencies
4. Establish a Monitoring Hierarchy
Create a hierarchical monitoring structure:
5. Implement Alert Fatigue Prevention
To prevent alert fatigue:
- Define alert severity levels (critical, warning, info)
- Implement alert throttling and grouping
- Use progressive alerting (escalate alerts over time)
- Regularly review and tune alert thresholds
6. Implement Log Rotation
Configure log rotation to prevent disk space issues:
sudo tee /etc/logrotate.d/custom-logs > /dev/null << EOF
/var/log/application/*.log {
daily
missingok
rotate 14
compress
delaycompress
notifempty
create 0640 www-data www-data
sharedscripts
postrotate
systemctl reload application.service > /dev/null 2>/dev/null || true
endscript
}
EOF
7. Implement Centralized Logging
Use tools like Elasticsearch, Logstash, and Kibana (ELK stack) for centralized logging:
# Install Filebeat for log shipping
curl -L -O https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-8.5.3-amd64.deb
sudo dpkg -i filebeat-8.5.3-amd64.deb
# Configure Filebeat
sudo tee /etc/filebeat/filebeat.yml > /dev/null << EOF
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/*.log
- /var/log/syslog
- /var/log/auth.log
output.elasticsearch:
hosts: ["elasticsearch-host:9200"]
setup.kibana:
host: "kibana-host:5601"
EOF
# Start and enable Filebeat
sudo systemctl enable filebeat
sudo systemctl start filebeat
Practical Example: Complete Monitoring Setup for a Web Application
Let's implement a comprehensive monitoring solution for a web application running on Ubuntu:
Step 1: System and Service Monitoring
Install Node Exporter on all web servers and database servers following the instructions provided earlier.
Step 2: Application-Specific Monitoring
For a Node.js application, use Prometheus client library:
// app.js - Node.js application with Prometheus monitoring
const express = require('express');
const client = require('prom-client');
const app = express();
// Create a Registry to register metrics
const register = new client.Registry();
// Enable default metrics collection
const collectDefaultMetrics = client.collectDefaultMetrics;
collectDefaultMetrics({ register });
// Create custom metrics
const httpRequestDurationMicroseconds = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});
// Register the custom metrics
register.registerMetric(httpRequestDurationMicroseconds);
// Middleware to measure request duration
app.use((req, res, next) => {
const end = httpRequestDurationMicroseconds.startTimer();
res.on('finish', () => {
end({ method: req.method, route: req.route?.path || req.path, status_code: res.statusCode });
});
next();
});
// Endpoint for Prometheus to scrape metrics
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
// Application routes
app.get('/', (req, res) => {
res.send('Hello World!');
});
app.listen(3000, () => {
console.log('Server running on port 3000');
});
Step 3: Database Monitoring
Install the PostgreSQL Exporter for database monitoring:
# Download PostgreSQL Exporter
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.11.1/postgres_exporter-0.11.1.linux-amd64.tar.gz
# Extract the archive
tar xvf postgres_exporter-0.11.1.linux-amd64.tar.gz
# Move binary to /usr/local/bin
sudo mv postgres_exporter-0.11.1.linux-amd64/postgres_exporter /usr/local/bin/
# Create dedicated PostgreSQL user for monitoring
sudo -u postgres psql -c "CREATE USER postgres_exporter WITH PASSWORD 'password';"
sudo -u postgres psql -c "GRANT pg_monitor TO postgres_exporter;"
# Create systemd service
sudo tee /etc/systemd/system/postgres_exporter.service > /dev/null << EOF
[Unit]
Description=PostgreSQL Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=postgres
Group=postgres
Type=simple
Environment="DATA_SOURCE_NAME=postgresql://postgres_exporter:password@localhost:5432/postgres?sslmode=disable"
ExecStart=/usr/local/bin/postgres_exporter
[Install]
WantedBy=multi-user.target
EOF
# Start and enable service
sudo systemctl daemon-reload
sudo systemctl start postgres_exporter
sudo systemctl enable postgres_exporter
Step 4: Update Prometheus Configuration
Update Prometheus configuration to scrape metrics from all components:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'web_servers'
static_configs:
- targets:
- 'web-server-1:9100' # Node Exporter
- 'web-server-2:9100'
- job_name: 'db_servers'
static_configs:
- targets:
- 'db-server:9100' # Node Exporter
- 'db-server:9187' # PostgreSQL Exporter
- job_name: 'nodejs_app'
static_configs:
- targets:
- 'web-server-1:3000' # Node.js app metrics
- 'web-server-2:3000'
Step 5: Create Grafana Dashboards
Create custom dashboards for different aspects of your application:
- System Overview Dashboard
- Web Application Performance Dashboard
- Database Performance Dashboard
- Business Metrics Dashboard
Troubleshooting Common Monitoring Issues
Problem: High CPU Usage on Monitored Nodes
Solution:
- Identify the process consuming CPU:
bash
sudo apt install htop
htop -u --sort-key=PERCENT_CPU - Check if it's related to the monitoring agent:
bash
ps aux | grep node_exporter
- Adjust scrape interval in Prometheus if needed.
Problem: Missing Data in Grafana
Solution:
- Check if Prometheus is scraping targets:
bash
curl http://localhost:9090/api/v1/targets
- Verify firewall rules:
bash
sudo ufw status
- Check for network connectivity issues:
bash
telnet target-host 9100
Problem: Disk Space Filling Up with Logs
Solution:
- Identify large log files:
bash
sudo du -h /var/log/ | sort -hr | head -10
- Configure log rotation as shown earlier
- Consider using logrotate's
maxsize
directive:maxsize 100M
Summary
Effective monitoring is essential for maintaining reliable and efficient Ubuntu cloud environments. By implementing the tools and practices covered in this guide, you can:
- Gain visibility into system and application performance
- Detect and resolve issues before they affect users
- Optimize resource utilization and reduce costs
- Ensure security and compliance
Remember that monitoring is not a one-time setup but an ongoing process that requires regular review and adjustment as your infrastructure evolves.
Additional Resources
- Prometheus Documentation
- Grafana Documentation
- Ubuntu Server Guide
- Cloud-Native Monitoring with Prometheus
Exercises
- Set up a basic Prometheus and Node Exporter monitoring system on an Ubuntu VM.
- Create a custom Grafana dashboard showing system metrics.
- Configure alerts for high CPU, memory, and disk usage.
- Implement application-level monitoring for a web application of your choice.
- Set up centralized logging with the ELK stack.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)