Network Monitoring
Introduction
Network monitoring is a critical aspect of maintaining a healthy IT infrastructure. It involves the continuous observation of network components such as routers, switches, servers, and endpoints to ensure optimal performance, detect anomalies, and troubleshoot issues before they affect users. In this guide, we'll explore how to implement effective network monitoring using Grafana as our visualization platform.
Why Network Monitoring Matters
Networks form the backbone of modern computing environments. When network issues occur, they can impact everything from user productivity to customer experience. Effective network monitoring helps you:
- Detect and resolve issues before they impact end-users
- Establish baseline performance metrics
- Track network utilization and plan capacity
- Identify security threats and anomalies
- Ensure compliance with service level agreements (SLAs)
Network Monitoring Basics
Before diving into implementation, let's understand some fundamental concepts in network monitoring:
Key Network Metrics
Monitoring the right metrics is essential for network visibility:
- Availability: Is the device or service responding?
- Latency: How long does it take for data to travel?
- Packet Loss: Are data packets being dropped?
- Bandwidth Utilization: How much of your available bandwidth is being used?
- Error Rates: How many errors are occurring on interfaces?
- Throughput: How much data is successfully transmitted over time?
The Monitoring Stack
A typical Grafana-based network monitoring stack includes:
- Data Sources: Tools like SNMP exporters, Prometheus Node Exporters, or specialized network monitoring agents
- Storage: Time-series databases like Prometheus, InfluxDB, or Graphite
- Visualization: Grafana dashboards
- Alerting: Grafana or external alerting systems
Implementing Network Monitoring with Grafana
Let's walk through setting up a basic network monitoring solution with Grafana and Prometheus.
Step 1: Set Up Data Collection
The first step is to collect network metrics. For our example, we'll use SNMP (Simple Network Management Protocol) which is widely supported by network devices.
Installing and Configuring the SNMP Exporter
The SNMP Exporter converts SNMP data into a format Prometheus can scrape:
# Download the SNMP exporter
wget https://github.com/prometheus/snmp_exporter/releases/download/v0.20.0/snmp_exporter-0.20.0.linux-amd64.tar.gz
# Extract the files
tar -xzf snmp_exporter-0.20.0.linux-amd64.tar.gz
# Move to the appropriate directory
cd snmp_exporter-0.20.0.linux-amd64/
Create a configuration file snmp.yml
with your network device details:
default:
auth:
community: public
version: 2
retries: 3
timeout: 10s
walk_params:
max_repetitions: 25
metrics:
- name: ifHCInOctets
oid: 1.3.6.1.2.1.31.1.1.1.6
type: counter
- name: ifHCOutOctets
oid: 1.3.6.1.2.1.31.1.1.1.10
type: counter
- name: ifOperStatus
oid: 1.3.6.1.2.1.2.2.1.8
type: gauge
- name: ifAdminStatus
oid: 1.3.6.1.2.1.2.2.1.7
type: gauge
Run the SNMP exporter:
./snmp_exporter --config.file=snmp.yml
Step 2: Configure Prometheus to Scrape Network Metrics
Update your prometheus.yml
file to scrape the SNMP exporter:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'snmp'
static_configs:
- targets:
- router1:161 # Your network device IP and SNMP port
- switch1:161
metrics_path: /snmp
params:
module: [default]
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: localhost:9116 # SNMP exporter address
Step 3: Create a Grafana Dashboard
Now let's create a dashboard to visualize our network metrics. In Grafana:
- Add Prometheus as a data source
- Create a new dashboard
- Add panels for key metrics
Here's a PromQL query example for network interface throughput:
rate(ifHCInOctets{instance="router1:161",ifIndex="2"}[5m]) * 8
This query shows the input traffic rate in bits per second for interface 2 on router1.
Sample Network Monitoring Dashboard
Below is an example of how to structure your Grafana dashboard for network monitoring:
Dashboard Sections
-
Overview Panels
- Network Map
- Global Status Summary
- Alert Status
-
Device Status
- Uptime
- CPU Utilization
- Memory Usage
-
Interface Metrics
- Throughput (In/Out)
- Packet Loss
- Errors and Discards
-
Latency Metrics
- Round Trip Time (RTT)
- Jitter
- DNS Response Time
Example Dashboard JSON
Here's a snippet of a Grafana dashboard JSON for network monitoring:
{
"panels": [
{
"title": "Interface Throughput",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "rate(ifHCInOctets{instance=~\"$device\",ifIndex=~\"$interface\"}[5m]) * 8",
"legendFormat": "{{instance}} - {{ifDescr}} - In",
"refId": "A"
},
{
"expr": "rate(ifHCOutOctets{instance=~\"$device\",ifIndex=~\"$interface\"}[5m]) * 8",
"legendFormat": "{{instance}} - {{ifDescr}} - Out",
"refId": "B"
}
],
"yaxes": [
{
"format": "bps",
"label": "Throughput"
},
{
"format": "short",
"show": false
}
]
},
{
"title": "Interface Status",
"type": "stat",
"datasource": "Prometheus",
"targets": [
{
"expr": "ifOperStatus{instance=~\"$device\",ifIndex=~\"$interface\"}",
"legendFormat": "{{instance}} - {{ifDescr}}",
"instant": true
}
],
"options": {
"colorMode": "value",
"graphMode": "none",
"justifyMode": "auto",
"mappings": [
{
"type": "value",
"options": {
"1": {
"text": "Up",
"color": "green"
},
"2": {
"text": "Down",
"color": "red"
}
}
}
]
}
}
]
}
Advanced Network Monitoring Techniques
Once you have basic monitoring in place, consider implementing these advanced techniques:
Network Traffic Analysis
Use Grafana to create visualizations that help identify traffic patterns and anomalies:
sum by (instance) (rate(ifHCInOctets[5m]) * 8)
This query shows the total input traffic across all interfaces per device.
Alerting on Network Issues
Configure Grafana alerts to notify you when network metrics exceed thresholds:
- Edit any graph panel
- Go to the "Alert" tab
- Define conditions, for example:
- WHEN last() OF query(A) IS ABOVE 90
- (This alerts when bandwidth utilization exceeds 90%)
Network Health Scoring
Create a composite network health score by combining multiple metrics:
(
(avg_over_time(up{job="snmp"}[5m]) * 100) +
(100 - avg_over_time(rate(ifInErrors[5m])[5m]) * 100) +
(100 - min_over_time((rate(ifHCInOctets[5m]) / 1000000000)[5m]) * 100)
) / 3
This creates a health score (0-100) based on uptime, error rates, and bandwidth utilization.
Network Visualization with Grafana
Grafana offers several visualization options particularly useful for network monitoring:
Topology Maps
Use the Grafana Topology Panel plugin to create interactive network maps:
// Sample topology configuration
{
"nodes": [
{ "id": "router1", "title": "Core Router", "mainStat": "{{value}}", "secondaryStat": "Mbps", "arc__failed": 0.5 },
{ "id": "switch1", "title": "Distribution Switch", "mainStat": "{{value}}", "secondaryStat": "Mbps", "arc__failed": 0.2 }
],
"edges": [
{ "id": "edge1", "source": "router1", "target": "switch1", "mainStat": "{{value}}" }
]
}
Heatmaps for Traffic Patterns
Heatmaps can visualize traffic patterns over time to identify peak usage:
rate(ifHCInOctets{instance=~"$device"}[5m]) * 8
Troubleshooting with Grafana
When network issues occur, Grafana can help identify the root cause:
- Correlate Events: Use Grafana annotations to mark when changes or incidents occurred
- Compare Metrics: Use the built-in comparison feature to compare current metrics with historical data
- Drill Down: Create linked dashboards that allow you to drill down from a high-level view to detailed interface metrics
Common Network Monitoring Patterns
Here are some proven patterns for network monitoring with Grafana:
The RED Pattern
For service-level monitoring:
- Rate: Requests per second
- Errors: Number of failed requests
- Duration: Distribution of response times
The USE Pattern
For resource-level monitoring:
- Utilization: Percentage of time the resource is busy
- Saturation: Amount of work resource has to do (queue)
- Errors: Count of error events
Summary
Effective network monitoring with Grafana provides visibility into your network infrastructure, helping you maintain performance, quickly identify issues, and plan for future capacity needs. By following the patterns and techniques outlined in this guide, you can build comprehensive dashboards that give you both high-level overviews and detailed insights into your network's health.
Additional Resources
- Grafana Official Documentation
- Prometheus SNMP Exporter Documentation
- PromQL Cheat Sheet for Network Monitoring
Exercises
- Set up SNMP monitoring for a local network device and create a basic Grafana dashboard showing interface utilization.
- Create an alert in Grafana that triggers when a network device becomes unreachable.
- Design a dashboard that combines network metrics with application performance to show the correlation between network issues and application behavior.
- Implement the USE pattern for a core network device, creating visualizations for utilization, saturation, and errors.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)