GCE Service Discovery

Introduction

When monitoring applications deployed on Google Cloud Platform (GCP), manually maintaining a list of targets to scrape can become unwieldy as your infrastructure scales. Prometheus offers a powerful solution with its Google Compute Engine (GCE) service discovery mechanism, which automatically discovers and monitors your GCE instances.

This service discovery integration allows Prometheus to:

Automatically discover GCE instances in your projects
Apply filtering based on metadata and other attributes
Dynamically update monitoring targets as instances are created or terminated
Extract labels from GCE metadata for better organization of metrics

In this guide, we'll explore how to configure and use GCE service discovery with Prometheus, providing practical examples to help you integrate this capability into your monitoring infrastructure.

Prerequisites

Before setting up GCE service discovery, you'll need:

A Google Cloud Platform (GCP) account with access to Google Compute Engine
Prometheus installed (either on GCE or with network access to your GCE instances)
Proper IAM permissions to allow Prometheus to access GCE API
Basic understanding of Prometheus configuration

How GCE Service Discovery Works

Prometheus uses the GCP API to discover instances across your GCE projects. Here's how the process works:

The discovery process involves:

Prometheus making API calls to Google Cloud
Retrieving metadata about running instances
Creating target endpoints based on discovered instances
Applying relabeling rules to organize and filter targets
Scraping metrics from the discovered endpoints

Configuring GCE Service Discovery

Let's look at how to configure Prometheus to discover and scrape GCE instances:

Authentication Setup

First, Prometheus needs to authenticate with Google Cloud. There are several methods:

Using Google Application Default Credentials:
- When Prometheus runs on GCE, it can use the instance's service account
- Configure the service account with appropriate IAM roles (compute.viewer at minimum)
Using a Service Account Key File:
- Create a service account with the necessary permissions
- Download the key file and reference it in your Prometheus config

Basic Configuration

Add the following to your prometheus.yml file to enable GCE service discovery:

scrape_configs:
  - job_name: 'gce-instances'
    gce_sd_configs:
      - project: 'your-gcp-project-id'
        zone: 'us-central1-a'
        port: 9100  # Default port for node_exporter

This configuration will:

Discover all GCE instances in the specified project and zone
Assume each instance has a metrics endpoint on port 9100 (typical for node_exporter)

Multi-Zone and Multi-Project Configuration

To monitor instances across multiple zones or projects:

scrape_configs:
  - job_name: 'gce-instances'
    gce_sd_configs:
      - project: 'project-1'
        zones: 
          - 'us-central1-a'
          - 'us-central1-b'
        port: 9100
      - project: 'project-2'
        zones:
          - 'us-west1-a'
        port: 9100

Authentication with a Key File

If you're running Prometheus outside of GCP or need to use a specific service account:

scrape_configs:
  - job_name: 'gce-instances'
    gce_sd_configs:
      - project: 'your-gcp-project-id'
        zone: 'us-central1-a'
        port: 9100
        credentials_file: '/path/to/service-account-key.json'

Available Labels and Relabeling

GCE service discovery automatically attaches several metadata labels to discovered targets. These include:

__meta_gce_instance_id: The numeric instance ID
__meta_gce_instance_name: The user-defined instance name
__meta_gce_machine_type: The machine type of the instance
__meta_gce_metadata_NAME: Each metadata item becomes a label
__meta_gce_network: The network of the instance
__meta_gce_private_ip: The private IP address
__meta_gce_project: The GCP project
__meta_gce_public_ip: The public IP address (if available)
__meta_gce_tags: Instance tags (comma-separated)
__meta_gce_zone: The zone of the instance

Using Relabeling to Filter Instances

You can use relabeling to filter which instances Prometheus monitors:

scrape_configs:
  - job_name: 'gce-web-servers'
    gce_sd_configs:
      - project: 'your-gcp-project-id'
        zone: 'us-central1-a'
        port: 9100
    relabel_configs:
      - source_labels: [__meta_gce_tags]
        regex: '.*,web-server,.*'
        action: keep

This configuration will only keep targets that have the web-server tag.

Creating Meaningful Labels from Metadata

You can transform GCE metadata into Prometheus labels:

scrape_configs:
  - job_name: 'gce-instances'
    gce_sd_configs:
      - project: 'your-gcp-project-id'
        zone: 'us-central1-a'
        port: 9100
    relabel_configs:
      - source_labels: [__meta_gce_instance_name]
        target_label: instance
      - source_labels: [__meta_gce_metadata_environment]
        target_label: environment
      - source_labels: [__meta_gce_zone]
        target_label: zone

This configuration extracts the instance name, a custom metadata field called "environment", and the zone into Prometheus labels.

Practical Example: Complete Configuration

Here's a more comprehensive example that:

Discovers instances across multiple zones
Filters based on an "app" metadata property
Applies custom labeling
Adjusts the scrape interval for these targets

scrape_configs:
  - job_name: 'gce-app-monitoring'
    scrape_interval: 30s
    gce_sd_configs:
      - project: 'production-apps'
        zones:
          - 'us-central1-a'
          - 'us-central1-b'
          - 'us-central1-c'
        port: 9100
    relabel_configs:
      # Keep only instances with app=prometheus-monitored metadata
      - source_labels: [__meta_gce_metadata_app]
        regex: 'prometheus-monitored'
        action: keep
      
      # Create an app_component label from a metadata field
      - source_labels: [__meta_gce_metadata_component]
        target_label: app_component
      
      # Create an instance label from instance name
      - source_labels: [__meta_gce_instance_name]
        target_label: instance
      
      # Create a zone label
      - source_labels: [__meta_gce_zone]
        target_label: zone
        
      # If the instance has a custom metrics port defined in metadata, use it
      - source_labels: [__meta_gce_metadata_metrics_port]
        regex: (.+)
        target_label: __address__
        replacement: ${1}:${__meta_gce_metadata_metrics_port}

Best Practices

When using GCE service discovery with Prometheus, consider the following best practices:

Use Metadata for Configuration:
- Add metadata to your instances to control monitoring behavior
- Example: Add prometheus_port: 9100 as metadata to specify custom ports
Create Instance Groups by Purpose:
- Group similar instances with common tags or metadata
- Makes filtering and relabeling more consistent
Consider API Quotas:
- GCP has API request quotas
- Set appropriate scrape intervals to avoid excessive API calls
Use GCE Private IPs When Possible:
- More secure and avoids public internet traffic
- Use VPC peering or similar if Prometheus is in a different network
Include Zones in Labels:
- Allows you to analyze metrics by zone
- Helpful for identifying zone-specific issues

Troubleshooting GCE Service Discovery

If you're experiencing issues with GCE service discovery, check the following:

1. Authentication Problems

If Prometheus can't discover GCE instances, check:

IAM permissions on the service account
Validity of the credentials file
API access to compute.googleapis.com

2. No Targets Found

If no targets are being discovered:

Verify the project ID and zone are correct
Check if your relabeling rules might be filtering out all targets
Examine Prometheus logs for API errors

3. Service Discovery Works But No Metrics

If instances are discovered but no metrics are collected:

Verify the port configuration
Check if firewalls allow traffic to the metrics port
Ensure the monitoring agent is running on the instances

Example: Setting Up a Complete Monitoring Stack

Let's look at a complete example of setting up a monitoring stack on GCE with auto-discovery:

Create a VM Template with node_exporter:

# Install node_exporter on your instances
wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.3.1.linux-amd64.tar.gz
sudo cp node_exporter-1.3.1.linux-amd64/node_exporter /usr/local/bin/
sudo useradd -rs /bin/false node_exporter

# Create a systemd service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Start and enable the service
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

Add Custom Metadata to Instances:

Add these metadata items to your instances:

prometheus-monitored: true
app: web-server (or whatever describes your application)
environment: production (or staging, development, etc.)

Configure Prometheus:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'gce-node-exporters'
    gce_sd_configs:
      - project: 'your-gcp-project-id'
        zones:
          - 'us-central1-a'
          - 'us-central1-b'
        port: 9100
    relabel_configs:
      - source_labels: [__meta_gce_metadata_prometheus_monitored]
        regex: 'true'
        action: keep
      - source_labels: [__meta_gce_metadata_app]
        target_label: app
      - source_labels: [__meta_gce_metadata_environment]
        target_label: environment
      - source_labels: [__meta_gce_instance_name]
        target_label: instance

Create Prometheus Alerts Based on Labels:

groups:
- name: GCE Instance Alerts
  rules:
  - alert: HighCPUUsage
    expr: (1 - avg by(instance, app, environment) (irate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage on {{ $labels.instance }}"
      description: "CPU usage is above 80% for 5 minutes on {{ $labels.instance }} (app: {{ $labels.app }}, environment: {{ $labels.environment }})"

Summary

GCE service discovery is a powerful feature of Prometheus that enables automatic discovery and monitoring of your Google Cloud infrastructure. By leveraging this capability, you can:

Eliminate manual configuration of target endpoints
Automatically adapt your monitoring as instances are created or terminated
Apply sophisticated filtering based on metadata, tags, and other attributes
Extract meaningful labels from GCE metadata
Create a more maintainable and scalable monitoring system

As your Google Cloud infrastructure grows, GCE service discovery becomes increasingly valuable, allowing your monitoring to scale automatically with your environment.

Additional Resources

To deepen your understanding of GCE service discovery with Prometheus, consider exploring these resources:

Exercises

To reinforce your learning, try these practical exercises:

Set up Prometheus with GCE service discovery to monitor your own GCE instances
Create a relabeling configuration that categorizes instances by both zone and machine type
Develop a Grafana dashboard that visualizes metrics grouped by the discovered GCE metadata labels
Configure different scrape intervals for different types of instances based on metadata
Implement a multi-project discovery setup and compare resource usage across projects

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Prerequisites​

How GCE Service Discovery Works​

Configuring GCE Service Discovery​

Authentication Setup​

Basic Configuration​

Multi-Zone and Multi-Project Configuration​

Authentication with a Key File​

Available Labels and Relabeling​

Using Relabeling to Filter Instances​

Creating Meaningful Labels from Metadata​

Practical Example: Complete Configuration​

Best Practices​

Troubleshooting GCE Service Discovery​

1. Authentication Problems​

2. No Targets Found​

3. Service Discovery Works But No Metrics​

Example: Setting Up a Complete Monitoring Stack​

Summary​

Additional Resources​

Exercises​