Cloud Service Monitoring
Introduction
Modern applications are increasingly deployed in cloud environments, making effective cloud service monitoring essential for maintaining reliability and performance. This guide explores how Prometheus can be used to monitor various cloud services across different providers, helping you gain visibility into your cloud infrastructure and applications.
Cloud service monitoring with Prometheus allows you to:
- Track the performance and health of cloud resources
- Create alerts for potential issues before they impact users
- Understand usage patterns to optimize costs
- Maintain compliance with service level agreements (SLAs)
- Make data-driven decisions about scaling and resource allocation
Understanding Cloud Service Monitoring with Prometheus
The Challenge of Cloud Monitoring
Cloud environments present unique monitoring challenges:
- Dynamic Infrastructure: Resources are created and destroyed frequently
- Distributed Components: Services span multiple regions and availability zones
- Multiple Abstraction Layers: From virtual machines to managed services
- Vendor-Specific Metrics: Each cloud provider has its own metrics and monitoring approaches
Prometheus helps address these challenges with its pull-based architecture, service discovery mechanisms, and flexible data model.
Cloud Provider Integrations
AWS Monitoring with Prometheus
Amazon Web Services can be monitored with Prometheus using several approaches:
Option 1: AWS Exporter
The AWS exporter allows you to scrape metrics from AWS APIs.
# Install AWS exporter
docker run -d -p 9686:9686 \
-e AWS_ACCESS_KEY_ID=your-access-key \
-e AWS_SECRET_ACCESS_KEY=your-secret-key \
-e AWS_REGION=us-west-1 \
quay.io/prometheuscommunity/cloudwatch-exporter:latest
Configure Prometheus to scrape the exporter:
scrape_configs:
- job_name: 'aws-cloudwatch'
static_configs:
- targets: ['localhost:9686']
Option 2: EC2 Service Discovery
For EC2 instances, you can use Prometheus' built-in EC2 service discovery:
scrape_configs:
- job_name: 'ec2-instances'
ec2_sd_configs:
- region: us-west-1
access_key: your-access-key
secret_key: your-secret-key
port: 9100
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
target_label: instance
GCP Monitoring with Prometheus
Google Cloud Platform offers several integration options:
Using GCE Service Discovery
scrape_configs:
- job_name: 'gce-instances'
gce_sd_configs:
- project: your-gcp-project
zone: us-central1-a
port: 9100
relabel_configs:
- source_labels: [__meta_gce_name]
target_label: instance
Using Stackdriver Exporter
# Install Stackdriver exporter
docker run -d -p 9255:9255 \
-e GOOGLE_APPLICATION_CREDENTIALS=/gcp-sa.json \
-v /path/to/gcp-sa.json:/gcp-sa.json \
gcr.io/stackdriver-prometheus/stackdriver-prometheus-sidecar:latest
Azure Monitoring with Prometheus
Microsoft Azure can be monitored with:
Azure Exporter
# Install Azure exporter
docker run -d -p 9276:9276 \
-e AZURE_SUBSCRIPTION_ID=your-subscription-id \
-e AZURE_CLIENT_ID=your-client-id \
-e AZURE_CLIENT_SECRET=your-client-secret \
-e AZURE_TENANT_ID=your-tenant-id \
webdevops/azure-metrics-exporter
Setting Up Comprehensive Cloud Monitoring
A complete cloud monitoring solution typically involves:
- Infrastructure Metrics: CPU, memory, disk and network usage
- Service Metrics: Specific to managed services like databases, message queues, etc.
- Cost Metrics: Usage that translates to billing
- Availability Metrics: Uptime and regional availability
Let's create a monitoring setup that covers these aspects:
Step 1: Deploy Node Exporters
For virtual machine metrics, deploy Node Exporter on each instance:
# Install Node Exporter on Linux instances
wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.3.1.linux-amd64.tar.gz
cd node_exporter-1.3.1.linux-amd64
./node_exporter &
Step 2: Configure Service Discovery
Use cloud-specific service discovery to automatically find and monitor instances:
scrape_configs:
- job_name: 'aws-ec2'
ec2_sd_configs:
- region: us-west-1
port: 9100
relabel_configs:
- source_labels: [__meta_ec2_tag_Environment]
regex: production
action: keep
Step 3: Monitor Managed Services
Use specialized exporters for managed services:
scrape_configs:
- job_name: 'aws-rds'
static_configs:
- targets: ['localhost:9042']
metrics_path: '/metrics/rds'
Step 4: Set Up Alerting
Create alerts for cloud-specific issues:
groups:
- name: cloud-alerts
rules:
- alert: HighCloudSpend
expr: sum(aws_billing_estimated_charges) > 1000
for: 6h
labels:
severity: warning
annotations:
summary: "High AWS spend detected"
description: "AWS charges have exceeded $1000 threshold"
- alert: InstanceOutOfMemory
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} out of memory"
description: "Instance has less than 10% available memory"
Cloud Monitoring Best Practices
For effective cloud monitoring with Prometheus:
-
Use Labels Effectively: Include cloud-specific metadata as labels
yamlrelabel_configs:
- source_labels: [__meta_ec2_availability_zone]
target_label: zone
- source_labels: [__meta_ec2_instance_type]
target_label: instance_type -
Implement Hierarchical Service Discovery: Discover different tiers of services
yamlscrape_configs:
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod -
Consider Using Remote Write: For long-term storage of cloud metrics
yamlremote_write:
- url: "https://prometheus.example.org/api/v1/write"
basic_auth:
username: "username"
password: "password" -
Implement Federation: For multi-region or multi-cloud setups
yamlscrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 'prometheus.us-region.example.org:9090'
- 'prometheus.eu-region.example.org:9090'
Visualizing Cloud Metrics
Create effective Grafana dashboards for cloud metrics:
Example PromQL queries for cloud dashboards:
-
EC2 CPU Utilization by Instance Type:
avg by (instance_type) (rate(node_cpu_seconds_total{mode!="idle"}[5m]) * 100)
-
S3 Request Latency:
aws_s3_request_latency_seconds_sum / aws_s3_request_latency_seconds_count
-
RDS Free Storage Space:
aws_rds_free_storage_space_average
-
Cross-Cloud Comparison:
sum by (provider) (rate(http_requests_total[5m]))
Practical Example: Multi-Cloud Monitoring
Let's implement a complete example for monitoring applications across AWS and GCP:
1. Set up exporters for each cloud
# AWS Exporter
docker run -d --name aws-exporter \
-p 9686:9686 \
-e AWS_ACCESS_KEY_ID=your-access-key \
-e AWS_SECRET_ACCESS_KEY=your-secret-key \
-e AWS_REGION=us-west-1 \
quay.io/prometheuscommunity/cloudwatch-exporter:latest
# GCP Exporter
docker run -d --name gcp-exporter \
-p 9255:9255 \
-e GOOGLE_APPLICATION_CREDENTIALS=/gcp-sa.json \
-v /path/to/gcp-sa.json:/gcp-sa.json \
gcr.io/stackdriver-prometheus/stackdriver-prometheus-sidecar:latest
2. Configure Prometheus to scrape both clouds
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'aws-cloudwatch'
static_configs:
- targets: ['localhost:9686']
metrics_path: '/metrics'
- job_name: 'gcp-stackdriver'
static_configs:
- targets: ['localhost:9255']
metrics_path: '/metrics'
- job_name: 'aws-ec2'
ec2_sd_configs:
- region: us-west-1
access_key: your-access-key
secret_key: your-secret-key
port: 9100
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
target_label: instance
- source_labels: [__meta_ec2_instance_type]
target_label: instance_type
- target_label: provider
replacement: aws
- job_name: 'gcp-instances'
gce_sd_configs:
- project: your-gcp-project
zone: us-central1-a
port: 9100
relabel_configs:
- source_labels: [__meta_gce_name]
target_label: instance
- source_labels: [__meta_gce_machine_type]
target_label: instance_type
- target_label: provider
replacement: gcp
3. Create multi-cloud recording rules
groups:
- name: cloud-resources
rules:
- record: job:node_cpu_utilization:avg
expr: avg by (provider, job) (rate(node_cpu_seconds_total{mode!="idle"}[5m]) * 100)
- record: job:node_memory_utilization:avg
expr: avg by (provider, job) (100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100))
- record: provider:request_errors:rate5m
expr: sum by (provider) (rate(http_requests_total{status=~"5.."}[5m]))
4. Set up cross-cloud alerts
groups:
- name: cloud-alerts
rules:
- alert: HighErrorRate
expr: sum by (provider) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (provider) (rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.provider }}"
description: "Error rate is above 5% on {{ $labels.provider }}"
- alert: CloudCostSpike
expr: sum by (provider) (rate(node_network_transmit_bytes_total[1d])) > 10 * avg_over_time(sum by (provider) (rate(node_network_transmit_bytes_total[1d]))[7d:1d])
for: 6h
labels:
severity: warning
annotations:
summary: "Unusual egress traffic on {{ $labels.provider }}"
description: "Network egress is 10x higher than weekly average, possible cost implications"
Common Pitfalls and Solutions
When monitoring cloud services with Prometheus, watch out for:
-
Credential Management: Rotate cloud credentials regularly and use IAM roles when possible
-
Rate Limiting: Cloud APIs often have rate limits
yamlscrape_configs:
- job_name: 'aws-cloudwatch'
scrape_interval: 5m # Longer interval to avoid rate limits -
Cost Management: Some cloud provider APIs charge per API call
yamlscrape_configs:
- job_name: 'expensive-metrics'
scrape_interval: 30m # Longer interval to reduce costs -
Missing Instances: Ensure your service discovery configuration is comprehensive
yamlec2_sd_configs:
- region: us-west-1
- region: us-east-1
- region: eu-west-1
Summary
Cloud service monitoring with Prometheus provides a unified approach to observing your infrastructure and applications across multiple cloud providers. By leveraging Prometheus' service discovery capabilities, exporters, and powerful query language, you can gain comprehensive visibility into your cloud resources.
Key takeaways:
- Use cloud-specific exporters to collect metrics from managed services
- Configure service discovery to automatically monitor dynamic cloud resources
- Apply consistent labeling to enable cross-cloud comparisons
- Implement appropriate recording rules and alerts for cloud-specific concerns
- Consider federation and remote storage for large-scale deployments
Further Learning
To deepen your knowledge about cloud service monitoring with Prometheus:
- Explore cloud-specific exporters in the Prometheus community
- Learn about optimization techniques for high-cardinality cloud metrics
- Experiment with multi-region and multi-cloud federation
- Practice building comprehensive cloud dashboards in Grafana
Exercises
- Set up Prometheus to monitor AWS EC2 instances using service discovery
- Create a Grafana dashboard showing comparative metrics between two cloud providers
- Implement alerts for unusual cloud spending patterns
- Configure a federated setup for multiple regions in the same cloud provider
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)