Cloud Service Monitoring

Introduction

Modern applications are increasingly deployed in cloud environments, making effective cloud service monitoring essential for maintaining reliability and performance. This guide explores how Prometheus can be used to monitor various cloud services across different providers, helping you gain visibility into your cloud infrastructure and applications.

Cloud service monitoring with Prometheus allows you to:

Track the performance and health of cloud resources
Create alerts for potential issues before they impact users
Understand usage patterns to optimize costs
Maintain compliance with service level agreements (SLAs)
Make data-driven decisions about scaling and resource allocation

Understanding Cloud Service Monitoring with Prometheus

The Challenge of Cloud Monitoring

Cloud environments present unique monitoring challenges:

Dynamic Infrastructure: Resources are created and destroyed frequently
Distributed Components: Services span multiple regions and availability zones
Multiple Abstraction Layers: From virtual machines to managed services
Vendor-Specific Metrics: Each cloud provider has its own metrics and monitoring approaches

Prometheus helps address these challenges with its pull-based architecture, service discovery mechanisms, and flexible data model.

Cloud Provider Integrations

AWS Monitoring with Prometheus

Amazon Web Services can be monitored with Prometheus using several approaches:

Option 1: AWS Exporter

The AWS exporter allows you to scrape metrics from AWS APIs.

# Install AWS exporter
docker run -d -p 9686:9686 \
  -e AWS_ACCESS_KEY_ID=your-access-key \
  -e AWS_SECRET_ACCESS_KEY=your-secret-key \
  -e AWS_REGION=us-west-1 \
  quay.io/prometheuscommunity/cloudwatch-exporter:latest

Configure Prometheus to scrape the exporter:

scrape_configs:
  - job_name: 'aws-cloudwatch'
    static_configs:
      - targets: ['localhost:9686']

Option 2: EC2 Service Discovery

For EC2 instances, you can use Prometheus' built-in EC2 service discovery:

scrape_configs:
  - job_name: 'ec2-instances'
    ec2_sd_configs:
      - region: us-west-1
        access_key: your-access-key
        secret_key: your-secret-key
        port: 9100
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance

GCP Monitoring with Prometheus

Google Cloud Platform offers several integration options:

Using GCE Service Discovery

scrape_configs:
  - job_name: 'gce-instances'
    gce_sd_configs:
      - project: your-gcp-project
        zone: us-central1-a
        port: 9100
    relabel_configs:
      - source_labels: [__meta_gce_name]
        target_label: instance

Using Stackdriver Exporter

# Install Stackdriver exporter
docker run -d -p 9255:9255 \
  -e GOOGLE_APPLICATION_CREDENTIALS=/gcp-sa.json \
  -v /path/to/gcp-sa.json:/gcp-sa.json \
  gcr.io/stackdriver-prometheus/stackdriver-prometheus-sidecar:latest

Azure Monitoring with Prometheus

Microsoft Azure can be monitored with:

Azure Exporter

# Install Azure exporter
docker run -d -p 9276:9276 \
  -e AZURE_SUBSCRIPTION_ID=your-subscription-id \
  -e AZURE_CLIENT_ID=your-client-id \
  -e AZURE_CLIENT_SECRET=your-client-secret \
  -e AZURE_TENANT_ID=your-tenant-id \
  webdevops/azure-metrics-exporter

Setting Up Comprehensive Cloud Monitoring

A complete cloud monitoring solution typically involves:

Infrastructure Metrics: CPU, memory, disk and network usage
Service Metrics: Specific to managed services like databases, message queues, etc.
Cost Metrics: Usage that translates to billing
Availability Metrics: Uptime and regional availability

Let's create a monitoring setup that covers these aspects:

Step 1: Deploy Node Exporters

For virtual machine metrics, deploy Node Exporter on each instance:

# Install Node Exporter on Linux instances
wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.3.1.linux-amd64.tar.gz
cd node_exporter-1.3.1.linux-amd64
./node_exporter &

Step 2: Configure Service Discovery

Use cloud-specific service discovery to automatically find and monitor instances:

scrape_configs:
  - job_name: 'aws-ec2'
    ec2_sd_configs:
      - region: us-west-1
        port: 9100
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Environment]
        regex: production
        action: keep

Step 3: Monitor Managed Services

Use specialized exporters for managed services:

scrape_configs:
  - job_name: 'aws-rds'
    static_configs:
      - targets: ['localhost:9042']
    metrics_path: '/metrics/rds'

Step 4: Set Up Alerting

Create alerts for cloud-specific issues:

groups:
- name: cloud-alerts
  rules:
  - alert: HighCloudSpend
    expr: sum(aws_billing_estimated_charges) > 1000
    for: 6h
    labels:
      severity: warning
    annotations:
      summary: "High AWS spend detected"
      description: "AWS charges have exceeded $1000 threshold"
      
  - alert: InstanceOutOfMemory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} out of memory"
      description: "Instance has less than 10% available memory"

Cloud Monitoring Best Practices

For effective cloud monitoring with Prometheus:

Use Labels Effectively: Include cloud-specific metadata as labels

relabel_configs:
  - source_labels: [__meta_ec2_availability_zone]
    target_label: zone
  - source_labels: [__meta_ec2_instance_type]
    target_label: instance_type

Implement Hierarchical Service Discovery: Discover different tiers of services

scrape_configs:
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod

Consider Using Remote Write: For long-term storage of cloud metrics

remote_write:
  - url: "https://prometheus.example.org/api/v1/write"
    basic_auth:
      username: "username"
      password: "password"

Implement Federation: For multi-region or multi-cloud setups

scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="prometheus"}'
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets:
        - 'prometheus.us-region.example.org:9090'
        - 'prometheus.eu-region.example.org:9090'

Visualizing Cloud Metrics

Create effective Grafana dashboards for cloud metrics:

Example PromQL queries for cloud dashboards:

EC2 CPU Utilization by Instance Type:

avg by (instance_type) (rate(node_cpu_seconds_total{mode!="idle"}[5m]) * 100)

S3 Request Latency:

aws_s3_request_latency_seconds_sum / aws_s3_request_latency_seconds_count

RDS Free Storage Space:
```
aws_rds_free_storage_space_average
```

Cross-Cloud Comparison:

sum by (provider) (rate(http_requests_total[5m]))

Practical Example: Multi-Cloud Monitoring

Let's implement a complete example for monitoring applications across AWS and GCP:

1. Set up exporters for each cloud

# AWS Exporter
docker run -d --name aws-exporter \
  -p 9686:9686 \
  -e AWS_ACCESS_KEY_ID=your-access-key \
  -e AWS_SECRET_ACCESS_KEY=your-secret-key \
  -e AWS_REGION=us-west-1 \
  quay.io/prometheuscommunity/cloudwatch-exporter:latest

# GCP Exporter  
docker run -d --name gcp-exporter \
  -p 9255:9255 \
  -e GOOGLE_APPLICATION_CREDENTIALS=/gcp-sa.json \
  -v /path/to/gcp-sa.json:/gcp-sa.json \
  gcr.io/stackdriver-prometheus/stackdriver-prometheus-sidecar:latest

2. Configure Prometheus to scrape both clouds

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'aws-cloudwatch'
    static_configs:
      - targets: ['localhost:9686']
    metrics_path: '/metrics'
    
  - job_name: 'gcp-stackdriver'
    static_configs:
      - targets: ['localhost:9255']
    metrics_path: '/metrics'
    
  - job_name: 'aws-ec2'
    ec2_sd_configs:
      - region: us-west-1
        access_key: your-access-key
        secret_key: your-secret-key
        port: 9100
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance
      - source_labels: [__meta_ec2_instance_type]
        target_label: instance_type
      - target_label: provider
        replacement: aws
        
  - job_name: 'gcp-instances'
    gce_sd_configs:
      - project: your-gcp-project
        zone: us-central1-a
        port: 9100
    relabel_configs:
      - source_labels: [__meta_gce_name]
        target_label: instance
      - source_labels: [__meta_gce_machine_type]
        target_label: instance_type
      - target_label: provider
        replacement: gcp

3. Create multi-cloud recording rules

groups:
- name: cloud-resources
  rules:
  - record: job:node_cpu_utilization:avg
    expr: avg by (provider, job) (rate(node_cpu_seconds_total{mode!="idle"}[5m]) * 100)
  
  - record: job:node_memory_utilization:avg
    expr: avg by (provider, job) (100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100))
    
  - record: provider:request_errors:rate5m
    expr: sum by (provider) (rate(http_requests_total{status=~"5.."}[5m]))

4. Set up cross-cloud alerts

groups:
- name: cloud-alerts
  rules:
  - alert: HighErrorRate
    expr: sum by (provider) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (provider) (rate(http_requests_total[5m])) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate on {{ $labels.provider }}"
      description: "Error rate is above 5% on {{ $labels.provider }}"
      
  - alert: CloudCostSpike
    expr: sum by (provider) (rate(node_network_transmit_bytes_total[1d])) > 10 * avg_over_time(sum by (provider) (rate(node_network_transmit_bytes_total[1d]))[7d:1d])
    for: 6h
    labels:
      severity: warning
    annotations:
      summary: "Unusual egress traffic on {{ $labels.provider }}"
      description: "Network egress is 10x higher than weekly average, possible cost implications"

Common Pitfalls and Solutions

When monitoring cloud services with Prometheus, watch out for:

Credential Management: Rotate cloud credentials regularly and use IAM roles when possible

Rate Limiting: Cloud APIs often have rate limits

scrape_configs:
  - job_name: 'aws-cloudwatch'
    scrape_interval: 5m  # Longer interval to avoid rate limits

Cost Management: Some cloud provider APIs charge per API call

scrape_configs:
  - job_name: 'expensive-metrics'
    scrape_interval: 30m  # Longer interval to reduce costs

Missing Instances: Ensure your service discovery configuration is comprehensive

ec2_sd_configs:
  - region: us-west-1
  - region: us-east-1
  - region: eu-west-1

Summary

Cloud service monitoring with Prometheus provides a unified approach to observing your infrastructure and applications across multiple cloud providers. By leveraging Prometheus' service discovery capabilities, exporters, and powerful query language, you can gain comprehensive visibility into your cloud resources.

Key takeaways:

Use cloud-specific exporters to collect metrics from managed services
Configure service discovery to automatically monitor dynamic cloud resources
Apply consistent labeling to enable cross-cloud comparisons
Implement appropriate recording rules and alerts for cloud-specific concerns
Consider federation and remote storage for large-scale deployments

Further Learning

To deepen your knowledge about cloud service monitoring with Prometheus:

Explore cloud-specific exporters in the Prometheus community
Learn about optimization techniques for high-cardinality cloud metrics
Experiment with multi-region and multi-cloud federation
Practice building comprehensive cloud dashboards in Grafana

Exercises

Set up Prometheus to monitor AWS EC2 instances using service discovery
Create a Grafana dashboard showing comparative metrics between two cloud providers
Implement alerts for unusual cloud spending patterns
Configure a federated setup for multiple regions in the same cloud provider

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Cloud Service Monitoring with Prometheus​

The Challenge of Cloud Monitoring​

Cloud Provider Integrations​

AWS Monitoring with Prometheus​

Option 1: AWS Exporter​

Option 2: EC2 Service Discovery​

GCP Monitoring with Prometheus​

Using GCE Service Discovery​

Using Stackdriver Exporter​

Azure Monitoring with Prometheus​

Azure Exporter​

Setting Up Comprehensive Cloud Monitoring​

Step 1: Deploy Node Exporters​

Step 2: Configure Service Discovery​

Step 3: Monitor Managed Services​

Step 4: Set Up Alerting​

Cloud Monitoring Best Practices​

Visualizing Cloud Metrics​

Practical Example: Multi-Cloud Monitoring​

1. Set up exporters for each cloud​

2. Configure Prometheus to scrape both clouds​

3. Create multi-cloud recording rules​

4. Set up cross-cloud alerts​

Common Pitfalls and Solutions​

Summary​

Further Learning​

Exercises​