Skip to main content

Ubuntu Cloud Monitoring

Introduction

Monitoring is a critical aspect of maintaining healthy and efficient cloud environments. For Ubuntu-based cloud deployments, proper monitoring ensures optimal performance, enhances security, and helps prevent downtime. This guide explores the fundamentals of Ubuntu cloud monitoring, essential tools, implementation strategies, and best practices to help you build a robust monitoring system for your cloud infrastructure.

Why Monitor Your Ubuntu Cloud Environment?

Cloud monitoring provides visibility into your infrastructure and applications, allowing you to:

  • Detect and resolve issues before they impact users
  • Optimize resource utilization and reduce costs
  • Ensure compliance with service level agreements (SLAs)
  • Identify security threats and vulnerabilities
  • Make data-driven decisions about scaling and improvements

Key Monitoring Components for Ubuntu Cloud

System-Level Metrics

At the foundation of cloud monitoring are system-level metrics that provide insight into the health and performance of your Ubuntu instances:

Metric TypeExamplesImportance
CPUUsage percentage, load average, context switchesIndicates processing capacity and potential bottlenecks
MemoryFree/used RAM, swap usage, page faultsHelps identify memory leaks and capacity issues
DiskI/O operations, free space, read/write latencyPrevents storage-related failures
NetworkBandwidth usage, packet loss, connection countMonitors connectivity and data transfer efficiency

Application-Level Monitoring

Beyond system metrics, monitoring the applications running on your Ubuntu cloud instances is essential:

  • Response times: How long your applications take to process requests
  • Error rates: Frequency and types of errors occurring in your applications
  • Request rates: Volume of traffic your applications are handling
  • Saturation: How "full" your application service is

Log Management

Logs provide detailed information about events occurring within your Ubuntu cloud environment:

  • System logs (/var/log/syslog, /var/log/auth.log, etc.)
  • Application logs
  • Service-specific logs (web servers, databases, etc.)
  • Cloud provider logs

Essential Ubuntu Cloud Monitoring Tools

1. Prometheus

Prometheus is a powerful open-source monitoring system with a dimensional data model and flexible query language.

Installation on Ubuntu

bash
# Update package information
sudo apt update

# Install dependencies
sudo apt install -y apt-transport-https software-properties-common

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.42.0/prometheus-2.42.0.linux-amd64.tar.gz

# Extract the archive
tar xvf prometheus-2.42.0.linux-amd64.tar.gz

# Move to /opt directory
sudo mv prometheus-2.42.0.linux-amd64 /opt/prometheus

# Create a Prometheus user
sudo useradd -rs /bin/false prometheus

# Create directories for configuration and data
sudo mkdir -p /etc/prometheus /var/lib/prometheus

# Set ownership
sudo chown -R prometheus:prometheus /opt/prometheus /etc/prometheus /var/lib/prometheus

Basic Configuration

Create a configuration file at /etc/prometheus/prometheus.yml:

yaml
global:
scrape_interval: 15s

scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

- job_name: 'node'
static_configs:
- targets: ['localhost:9100']

Create a systemd service

bash
sudo tee /etc/systemd/system/prometheus.service > /dev/null << EOF
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--web.console.templates=/opt/prometheus/consoles \
--web.console.libraries=/opt/prometheus/console_libraries

[Install]
WantedBy=multi-user.target
EOF

Start and enable the service:

bash
sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl enable prometheus

2. Node Exporter

Node Exporter collects system-level metrics from Ubuntu hosts.

bash
# Download Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz

# Extract the archive
tar xvf node_exporter-1.5.0.linux-amd64.tar.gz

# Move the binary
sudo mv node_exporter-1.5.0.linux-amd64/node_exporter /usr/local/bin/

# Create a Node Exporter user
sudo useradd -rs /bin/false node_exporter

# Create a systemd service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null << EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Start and enable the service
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

3. Grafana

Grafana provides visualization for your monitoring data with dashboards and alerts.

bash
# Add Grafana APT repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

# Update packages and install Grafana
sudo apt-get update
sudo apt-get install -y grafana

# Start and enable Grafana service
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

After installation, access Grafana at http://your-server-ip:3000 (default credentials: admin/admin).

4. Netdata

Netdata provides real-time monitoring with minimal configuration:

bash
# Install dependencies
sudo apt-get install -y curl

# Download and run the installation script
bash <(curl -Ss https://my-netdata.io/kickstart.sh)

Access Netdata dashboard at http://your-server-ip:19999.

Setting Up a Complete Monitoring Stack

The following example demonstrates how to set up a monitoring stack for multiple Ubuntu cloud instances:

Step 1: Install Monitoring Agents

Deploy Node Exporter to all Ubuntu instances:

bash
#!/bin/bash
# deploy-node-exporter.sh

# Install Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz
tar xvf node_exporter-1.5.0.linux-amd64.tar.gz
sudo mv node_exporter-1.5.0.linux-amd64/node_exporter /usr/local/bin/
sudo useradd -rs /bin/false node_exporter

# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null << EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Start and enable service
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

# Open firewall port for Prometheus scraping
sudo ufw allow from prometheus-server-ip to any port 9100

Step 2: Configure Prometheus

Update the Prometheus configuration to scrape metrics from all instances:

yaml
global:
scrape_interval: 15s
evaluation_interval: 15s

rule_files:
- "alert_rules.yml"

alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093

scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

- job_name: 'ubuntu_nodes'
static_configs:
- targets:
- 'instance-1:9100'
- 'instance-2:9100'
- 'instance-3:9100'
# Add more instances as needed

Step 3: Configure Alerting

Create an alert rules file at /etc/prometheus/alert_rules.yml:

yaml
groups:
- name: ubuntu_cloud_alerts
rules:
- alert: HighCPULoad
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU load (instance {{ $labels.instance }})"
description: "CPU load is > 80% for 5 minutes
VALUE = {{ $value }}
LABELS: {{ $labels }}"

- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage (instance {{ $labels.instance }})"
description: "Memory usage is > 85% for 5 minutes
VALUE = {{ $value }}
LABELS: {{ $labels }}"

- alert: LowDiskSpace
expr: node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space (instance {{ $labels.instance }})"
description: "Disk space is < 10% for 5 minutes
VALUE = {{ $value }}
LABELS: {{ $labels }}"

Step 4: Set Up Alert Manager

Install and configure Alert Manager:

bash
# Download Alert Manager
wget https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz

# Extract the archive
tar xvf alertmanager-0.25.0.linux-amd64.tar.gz

# Move to /opt directory
sudo mv alertmanager-0.25.0.linux-amd64 /opt/alertmanager

# Create user and set permissions
sudo useradd -rs /bin/false alertmanager
sudo chown -R alertmanager:alertmanager /opt/alertmanager

Create a configuration file at /opt/alertmanager/alertmanager.yml:

yaml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: '[email protected]'
smtp_auth_username: 'your-username'
smtp_auth_password: 'your-password'

route:
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'email-notifications'

receivers:
- name: 'email-notifications'
email_configs:
- to: '[email protected]'
send_resolved: true

inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']

Create a systemd service for Alert Manager:

bash
sudo tee /etc/systemd/system/alertmanager.service > /dev/null << EOF
[Unit]
Description=Alert Manager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/opt/alertmanager/alertmanager \
--config.file=/opt/alertmanager/alertmanager.yml \
--storage.path=/opt/alertmanager/data

[Install]
WantedBy=multi-user.target
EOF

Start and enable the Alert Manager service:

bash
sudo systemctl daemon-reload
sudo systemctl start alertmanager
sudo systemctl enable alertmanager

Step 5: Configure Grafana Dashboards

After setting up Grafana, add Prometheus as a data source:

  1. Navigate to Configuration > Data Sources
  2. Click "Add data source"
  3. Select "Prometheus"
  4. Set the URL to http://localhost:9090 (or your Prometheus server address)
  5. Click "Save & Test"

Import pre-built dashboards for Ubuntu monitoring:

  1. Go to Create > Import
  2. Enter dashboard ID 1860 (Node Exporter Full)
  3. Select your Prometheus data source
  4. Click "Import"

Monitoring Ubuntu Cloud Resources with Cloud-Native Tools

When your Ubuntu instances are running in public cloud environments, leverage cloud-native monitoring tools:

AWS CloudWatch Integration

For Ubuntu instances on AWS, install the CloudWatch agent:

bash
# Install CloudWatch agent
wget https://s3.amazonaws.com/amazoncloudwatch-agent/ubuntu/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i amazon-cloudwatch-agent.deb

# Configure the agent
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-config-wizard

Sample configuration file for /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json:

json
{
"agent": {
"metrics_collection_interval": 60,
"run_as_user": "cwagent"
},
"metrics": {
"metrics_collected": {
"cpu": {
"measurement": [
"cpu_usage_idle",
"cpu_usage_user",
"cpu_usage_system"
],
"resources": [
"*"
],
"totalcpu": true
},
"disk": {
"measurement": [
"used_percent",
"inodes_free"
],
"resources": [
"/"
]
},
"diskio": {
"measurement": [
"io_time",
"write_bytes",
"read_bytes",
"writes",
"reads"
],
"resources": [
"*"
]
},
"mem": {
"measurement": [
"mem_used_percent"
]
},
"netstat": {
"measurement": [
"tcp_established",
"tcp_time_wait"
]
},
"swap": {
"measurement": [
"swap_used_percent"
]
}
}
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/syslog",
"log_group_name": "syslog",
"log_stream_name": "{instance_id}"
},
{
"file_path": "/var/log/auth.log",
"log_group_name": "auth.log",
"log_stream_name": "{instance_id}"
}
]
}
}
}
}

Start the CloudWatch agent:

bash
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json

Azure Monitor Integration

For Ubuntu instances on Azure, install the Log Analytics agent:

bash
# Download and install the Log Analytics agent
wget https://raw.githubusercontent.com/Microsoft/OMS-Agent-for-Linux/master/installer/scripts/onboard_agent.sh
chmod +x onboard_agent.sh
sudo ./onboard_agent.sh -w <YOUR_WORKSPACE_ID> -s <YOUR_WORKSPACE_KEY>

Google Cloud Monitoring

For Ubuntu on Google Cloud, install the Ops Agent:

bash
# Add the Cloud Monitoring agent repository
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install

Best Practices for Ubuntu Cloud Monitoring

1. Define Clear Monitoring Objectives

Before implementing monitoring, define what you need to monitor:

  • Identify critical services and applications
  • Determine key performance indicators (KPIs)
  • Set thresholds for alerts based on historical data
  • Document monitoring requirements

2. Implement the USE Method

The USE method helps identify performance issues by focusing on:

  • Utilization: Percentage of time the resource is busy
  • Saturation: Amount of work a resource has queued
  • Errors: Count of error events

Apply this method to CPU, memory, disk, and network resources.

3. Implement the RED Method

For monitoring services, follow the RED method:

  • Rate: Requests per second
  • Errors: Failed requests per second
  • Duration: Distribution of request latencies

4. Establish a Monitoring Hierarchy

Create a hierarchical monitoring structure:

5. Implement Alert Fatigue Prevention

To prevent alert fatigue:

  • Define alert severity levels (critical, warning, info)
  • Implement alert throttling and grouping
  • Use progressive alerting (escalate alerts over time)
  • Regularly review and tune alert thresholds

6. Implement Log Rotation

Configure log rotation to prevent disk space issues:

bash
sudo tee /etc/logrotate.d/custom-logs > /dev/null << EOF
/var/log/application/*.log {
daily
missingok
rotate 14
compress
delaycompress
notifempty
create 0640 www-data www-data
sharedscripts
postrotate
systemctl reload application.service > /dev/null 2>/dev/null || true
endscript
}
EOF

7. Implement Centralized Logging

Use tools like Elasticsearch, Logstash, and Kibana (ELK stack) for centralized logging:

bash
# Install Filebeat for log shipping
curl -L -O https://artifacts.elastic.co/downloads/beats/filebeat/filebeat-8.5.3-amd64.deb
sudo dpkg -i filebeat-8.5.3-amd64.deb

# Configure Filebeat
sudo tee /etc/filebeat/filebeat.yml > /dev/null << EOF
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/*.log
- /var/log/syslog
- /var/log/auth.log

output.elasticsearch:
hosts: ["elasticsearch-host:9200"]

setup.kibana:
host: "kibana-host:5601"
EOF

# Start and enable Filebeat
sudo systemctl enable filebeat
sudo systemctl start filebeat

Practical Example: Complete Monitoring Setup for a Web Application

Let's implement a comprehensive monitoring solution for a web application running on Ubuntu:

Step 1: System and Service Monitoring

Install Node Exporter on all web servers and database servers following the instructions provided earlier.

Step 2: Application-Specific Monitoring

For a Node.js application, use Prometheus client library:

javascript
// app.js - Node.js application with Prometheus monitoring
const express = require('express');
const client = require('prom-client');
const app = express();

// Create a Registry to register metrics
const register = new client.Registry();

// Enable default metrics collection
const collectDefaultMetrics = client.collectDefaultMetrics;
collectDefaultMetrics({ register });

// Create custom metrics
const httpRequestDurationMicroseconds = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});

// Register the custom metrics
register.registerMetric(httpRequestDurationMicroseconds);

// Middleware to measure request duration
app.use((req, res, next) => {
const end = httpRequestDurationMicroseconds.startTimer();
res.on('finish', () => {
end({ method: req.method, route: req.route?.path || req.path, status_code: res.statusCode });
});
next();
});

// Endpoint for Prometheus to scrape metrics
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});

// Application routes
app.get('/', (req, res) => {
res.send('Hello World!');
});

app.listen(3000, () => {
console.log('Server running on port 3000');
});

Step 3: Database Monitoring

Install the PostgreSQL Exporter for database monitoring:

bash
# Download PostgreSQL Exporter
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.11.1/postgres_exporter-0.11.1.linux-amd64.tar.gz

# Extract the archive
tar xvf postgres_exporter-0.11.1.linux-amd64.tar.gz

# Move binary to /usr/local/bin
sudo mv postgres_exporter-0.11.1.linux-amd64/postgres_exporter /usr/local/bin/

# Create dedicated PostgreSQL user for monitoring
sudo -u postgres psql -c "CREATE USER postgres_exporter WITH PASSWORD 'password';"
sudo -u postgres psql -c "GRANT pg_monitor TO postgres_exporter;"

# Create systemd service
sudo tee /etc/systemd/system/postgres_exporter.service > /dev/null << EOF
[Unit]
Description=PostgreSQL Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=postgres
Group=postgres
Type=simple
Environment="DATA_SOURCE_NAME=postgresql://postgres_exporter:password@localhost:5432/postgres?sslmode=disable"
ExecStart=/usr/local/bin/postgres_exporter

[Install]
WantedBy=multi-user.target
EOF

# Start and enable service
sudo systemctl daemon-reload
sudo systemctl start postgres_exporter
sudo systemctl enable postgres_exporter

Step 4: Update Prometheus Configuration

Update Prometheus configuration to scrape metrics from all components:

yaml
global:
scrape_interval: 15s
evaluation_interval: 15s

rule_files:
- "alert_rules.yml"

alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093

scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']

- job_name: 'web_servers'
static_configs:
- targets:
- 'web-server-1:9100' # Node Exporter
- 'web-server-2:9100'

- job_name: 'db_servers'
static_configs:
- targets:
- 'db-server:9100' # Node Exporter
- 'db-server:9187' # PostgreSQL Exporter

- job_name: 'nodejs_app'
static_configs:
- targets:
- 'web-server-1:3000' # Node.js app metrics
- 'web-server-2:3000'

Step 5: Create Grafana Dashboards

Create custom dashboards for different aspects of your application:

  1. System Overview Dashboard
  2. Web Application Performance Dashboard
  3. Database Performance Dashboard
  4. Business Metrics Dashboard

Troubleshooting Common Monitoring Issues

Problem: High CPU Usage on Monitored Nodes

Solution:

  1. Identify the process consuming CPU:
    bash
    sudo apt install htop
    htop -u --sort-key=PERCENT_CPU
  2. Check if it's related to the monitoring agent:
    bash
    ps aux | grep node_exporter
  3. Adjust scrape interval in Prometheus if needed.

Problem: Missing Data in Grafana

Solution:

  1. Check if Prometheus is scraping targets:
    bash
    curl http://localhost:9090/api/v1/targets
  2. Verify firewall rules:
    bash
    sudo ufw status
  3. Check for network connectivity issues:
    bash
    telnet target-host 9100

Problem: Disk Space Filling Up with Logs

Solution:

  1. Identify large log files:
    bash
    sudo du -h /var/log/ | sort -hr | head -10
  2. Configure log rotation as shown earlier
  3. Consider using logrotate's maxsize directive:
    maxsize 100M

Summary

Effective monitoring is essential for maintaining reliable and efficient Ubuntu cloud environments. By implementing the tools and practices covered in this guide, you can:

  • Gain visibility into system and application performance
  • Detect and resolve issues before they affect users
  • Optimize resource utilization and reduce costs
  • Ensure security and compliance

Remember that monitoring is not a one-time setup but an ongoing process that requires regular review and adjustment as your infrastructure evolves.

Additional Resources

Exercises

  1. Set up a basic Prometheus and Node Exporter monitoring system on an Ubuntu VM.
  2. Create a custom Grafana dashboard showing system metrics.
  3. Configure alerts for high CPU, memory, and disk usage.
  4. Implement application-level monitoring for a web application of your choice.
  5. Set up centralized logging with the ELK stack.


If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)