MongoDB Monitoring Best Practices

Introduction

Effective monitoring is a critical component of managing any production MongoDB deployment. Without proper monitoring, you might miss performance degradation, resource constraints, or potential failures until they affect your application users. This guide covers essential MongoDB monitoring best practices to help you maintain healthy, performant databases and respond proactively to issues before they become critical.

Monitoring MongoDB involves tracking various metrics related to:

Database performance
Resource utilization
Replication health
Query performance
Security and access patterns
Storage capacity and utilization

By the end of this guide, you'll understand which metrics matter most, how to set up effective monitoring, and how to interpret the collected data to make informed decisions about your MongoDB deployments.

Why MongoDB Monitoring Matters

Before diving into specific metrics and tools, let's understand why monitoring is essential:

Performance Optimization: Identify bottlenecks and inefficient queries
Capacity Planning: Track resource usage trends to plan for scaling
Incident Detection: Catch issues early before they affect users
Security: Identify unusual access patterns or potential breaches
Data Integrity: Ensure replication is functioning correctly
SLA Compliance: Verify your database meets service level agreements

Essential MongoDB Metrics to Monitor

System-Level Metrics

These metrics relate to the underlying hardware and operating system:

1. CPU Usage

High CPU utilization can indicate inefficient queries or indexing issues.

javascript
// MongoDB command to check current operation metrics
db.currentOp(
  { 
    "active": true, 
    "secs_running": { "$gt": 5 } 
  }
)

Output:

json
{
  "inprog": [
    {
      "desc": "conn57",
      "opid": 123456,
      "active": true,
      "secs_running": 10,
      "op": "query",
      "ns": "mydb.mycollection",
      "query": { "status": "pending" },
      "client": "192.168.1.10:12345",
      "locks": { /* lock info */ },
      "waitingForLock": false,
      "numYields": 5,
      "threadId": "0x7f1a9d3cd700"
    }
  ]
}

2. Memory Usage

MongoDB performance heavily depends on having sufficient memory for its working set. When memory is constrained, performance can degrade significantly.

Key memory metrics:

Resident memory: Actual physical RAM used
Virtual memory: Total memory allocated
Page faults: Indicates data being read from disk instead of memory

3. Disk I/O

MongoDB writes data to disk, so disk performance is crucial:

IOPS (Input/Output Operations Per Second)
Disk latency
Disk queue depth

4. Network Traffic

Monitor:

Network throughput
Connection counts
Network errors

MongoDB-Specific Metrics

1. Operations Counters

MongoDB maintains counters for different operation types:

javascript
// Check operation counters
db.serverStatus().opcounters

Output:

json
{
  "insert": 3245,
  "query": 23456,
  "update": 1234,
  "delete": 345,
  "getmore": 5678,
  "command": 87654
}

These metrics show the number of each operation type since the server started. For monitoring, you should track the rate of change rather than absolute values.

2. Connection Statistics

javascript
// Check connection statistics
db.serverStatus().connections

Output:

json
{
  "current": 125,
  "available": 51075,
  "totalCreated": 1234,
  "active": 100,
  "exhaustIsMaster": 0,
  "exhaustHello": 0,
  "awaitingTopologyChanges": 0
}

Keep an eye on connection usage patterns and ensure you're not approaching the connection limit.

3. Replica Set Health

For deployments with replication, monitor:

javascript
// Check replica set status
rs.status()

Key metrics to watch:

Replication lag: How far behind secondaries are from the primary
Oplog window: How much time is covered by the operation log
Member states: Ensure all members are in the expected state

4. Query Performance

javascript
// Find slow queries
db.getSiblingDB("admin").system.profile.find(
  { millis: { $gt: 100 } }
).sort(
  { millis: -1 }
).limit(5)

Enable profiling to capture slow queries:

javascript
// Enable profiling for queries slower than 100ms
db.setProfilingLevel(1, { slowms: 100 })

5. Database Storage Metrics

Monitor collection and index sizes:

javascript
// Get statistics for a database
db.stats()

Output:

json
{
  "db": "mydb",
  "collections": 12,
  "views": 0,
  "objects": 45678,
  "avgObjSize": 256.3,
  "dataSize": 11700000,
  "storageSize": 13200000,
  "freeStorageSize": 1500000,
  "indexes": 25,
  "indexSize": 3400000,
  "totalSize": 16600000,
  "scaleFactor": 1,
  "fsUsedSize": 256000000000,
  "fsTotalSize": 1000000000000,
  "ok": 1
}

Setting Up Effective Monitoring

1. MongoDB Built-in Tools

MongoDB Compass

MongoDB Compass provides a GUI for monitoring many aspects of your database:

Real-time server statistics
Query performance analysis
Index suggestions
Schema visualization

MongoDB Cloud Manager/Ops Manager

MongoDB's official monitoring solutions provide comprehensive monitoring:

Performance metrics
Real-time alerting
Visualization of cluster statistics
Historical data for trend analysis

2. Third-Party Monitoring Solutions

Many general-purpose monitoring tools can be configured for MongoDB:

Prometheus with MongoDB Exporter: Open-source monitoring and alerting
Grafana: Visualization for MongoDB metrics
Datadog: Cloud monitoring with MongoDB integration
New Relic: Performance monitoring platform
Zabbix/Nagios: Enterprise monitoring solutions

3. Custom Monitoring Scripts

For specific needs, you might write custom monitoring scripts:

javascript
// Example monitoring script to check replication lag
const checkReplicationLag = () => {
  const status = rs.status();
  const primary = status.members.find(m => m.state === 1);
  
  status.members.forEach(member => {
    if (member.state === 2) { // Secondary
      const lagSeconds = Math.abs(member.optimeDate.getTime() - primary.optimeDate.getTime()) / 1000;
      print(`Member ${member.name} lag: ${lagSeconds.toFixed(2)} seconds`);
      
      // Alert if lag exceeds threshold
      if (lagSeconds > 60) {
        print(`ALERT: High replication lag on ${member.name}`);
        // Add code to send notification
      }
    }
  });
};

// Run check every minute
setInterval(checkReplicationLag, 60000);

Implementing Monitoring Best Practices

1. Define Clear Baselines

Before you can identify problems, you need to understand what "normal" looks like:

Collect metrics during typical operations for at least a week
Establish performance patterns across different times of day
Document expected ranges for key metrics

2. Set Appropriate Alerting Thresholds

Not all metrics require the same level of alerting:

Critical alerts: Immediate response required (e.g., replica set primary down)
Warning alerts: Needs attention soon (e.g., increasing replication lag)
Informational alerts: For capacity planning (e.g., disk space trending up)

Example alerting thresholds:

Metric	Warning Threshold	Critical Threshold
CPU Usage	`>70%` for 5 minutes	`>90%` for 5 minutes
Memory Usage	`>80%`	`>95%`
Replication Lag	`>60` seconds	`>300` seconds
Connections	`>70%` of limit	`>90%` of limit
Disk Space	`<25%` free	`<10%` free

3. Implement a Monitoring Rotation

For larger teams, implement a monitoring rotation where team members take turns being responsible for:

Reviewing monitoring dashboards
Investigating alerts
Documenting patterns or incidents
Suggesting monitoring improvements

4. Create Runbooks for Common Issues

Document procedures for handling common issues identified by monitoring:

Example runbook snippet for handling high CPU usage:

Check for long-running operations:

javascript
db.currentOp({"active": true, "secs_running": {$gt: 10}})

Review slow queries in the profiler:

javascript
 db.system.profile.find({millis:{$gt:100}}).sort({ts:-1})

Check for missing indexes:

javascript
db.collection.explain("executionStats").find({query_condition})

If necessary, kill long-running operations:
javascript
```
db.killOp(opId)
```

Real-World Monitoring Examples

Example 1: E-Commerce Application Monitoring

An e-commerce platform needs to ensure their MongoDB database can handle traffic spikes during sales events:

Key metrics to monitor:
- Query response times for product searches
- Read/write operation throughput
- Connection pool utilization
- Cache hit ratios
Monitoring strategy:
- Real-time dashboard for current load
- Historical comparisons with previous sales events
- Automatic scaling triggers based on connection counts
- Alerts for query performance degradation

javascript
// Example aggregation to monitor slow product searches
db.queries.aggregate([
  {
    $match: {
      namespace: "ecommerce.products",
      operation: "query",
      millis: { $gt: 100 },
      ts: { $gt: new Date(Date.now() - 3600000) } // Last hour
    }
  },
  {
    $group: {
      _id: null,
      count: { $sum: 1 },
      avgTime: { $avg: "$millis" },
      maxTime: { $max: "$millis" }
    }
  }
])

Example 2: Financial Services Application

A financial application requires high availability and data consistency:

Key metrics to monitor:
- Write concern acknowledgments and timing
- Replication health and lag
- Index usage for financial transaction queries
- Authentication and authorization events
Monitoring strategy:
- Primary focus on replication metrics
- Multiple notification channels for critical alerts
- Geographic distribution visualizations for global deployments
- Compliance reporting for audit requirements

javascript
// Check write concern performance
db.serverStatus().metrics.operation.writeLatency

Troubleshooting Common Issues with Monitoring

Problem: Unexpected High CPU Usage

Monitoring indicators:

Sustained CPU usage above 80%
Increasing query latency

Investigation steps:

javascript
// 1. Check for long-running operations
db.currentOp(
  {
    "active": true,
    "secs_running": { "$gt": 5 }
  }
)

// 2. Look for collection scans (missing indexes)
db.system.profile.find(
  {
    "planSummary": /COLLSCAN/,
    "millis": { "$gt": 100 }
  }
).sort({ "millis": -1 })

Common solutions:

Add missing indexes
Optimize query patterns
Implement read/write operation separation

Problem: Memory Pressure

Monitoring indicators:

Increasing page faults
Growing virtual memory usage
Working set exceeding available RAM

Investigation steps:

javascript
// Check memory statistics
db.serverStatus().mem

// Check working set 
db.serverStatus().wiredTiger.cache

Common solutions:

Increase available RAM
Limit in-memory sort sizes
Review indexing strategy

Summary

Effective MongoDB monitoring is a crucial aspect of database management that helps ensure optimal performance, reliability, and security. By monitoring system-level metrics, MongoDB-specific metrics, and implementing appropriate alerting and response procedures, you can maintain healthy database deployments.

Key takeaways:

Establish baselines before setting alert thresholds
Monitor both system-level and MongoDB-specific metrics
Implement proper alerting with appropriate severity levels
Create runbooks for common issues
Regularly review monitoring strategies and refine as needed

Monitoring is not a set-it-and-forget-it task. As your application evolves, your monitoring strategy should adapt to focus on the metrics most relevant to your current architecture and usage patterns.

Additional Resources

Exercises

Basic Monitoring Setup: Configure MongoDB to log slow queries (>100ms) and set up a simple script to analyze the logs.
Alert Configuration: Define appropriate warning and critical thresholds for CPU, memory, connections, and replication lag for a MongoDB replica set.
Dashboard Creation: Using a tool like Grafana or MongoDB Compass, create a dashboard showing key MongoDB performance metrics.
Benchmark Testing: Run a load test on a test MongoDB instance and observe how different metrics change under load.
Incident Response Drill: Simulate a common MongoDB issue (high CPU, replication delay, etc.) and practice following your monitoring and response procedures.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why MongoDB Monitoring Matters​

Essential MongoDB Metrics to Monitor​

System-Level Metrics​

1. CPU Usage​

2. Memory Usage​

3. Disk I/O​

4. Network Traffic​

MongoDB-Specific Metrics​

1. Operations Counters​

2. Connection Statistics​

3. Replica Set Health​

4. Query Performance​

5. Database Storage Metrics​

Setting Up Effective Monitoring​

1. MongoDB Built-in Tools​

MongoDB Compass​

MongoDB Cloud Manager/Ops Manager​

2. Third-Party Monitoring Solutions​

3. Custom Monitoring Scripts​

Implementing Monitoring Best Practices​

1. Define Clear Baselines​

2. Set Appropriate Alerting Thresholds​

3. Implement a Monitoring Rotation​

4. Create Runbooks for Common Issues​

Real-World Monitoring Examples​

Example 1: E-Commerce Application Monitoring​

Example 2: Financial Services Application​

Troubleshooting Common Issues with Monitoring​

Problem: Unexpected High CPU Usage​

Problem: Memory Pressure​

Summary​

Additional Resources​

Exercises​

Introduction

Why MongoDB Monitoring Matters

Essential MongoDB Metrics to Monitor

System-Level Metrics

1. CPU Usage

2. Memory Usage

3. Disk I/O

4. Network Traffic

MongoDB-Specific Metrics

1. Operations Counters

2. Connection Statistics

3. Replica Set Health

4. Query Performance

5. Database Storage Metrics

Setting Up Effective Monitoring

1. MongoDB Built-in Tools

MongoDB Compass

MongoDB Cloud Manager/Ops Manager

2. Third-Party Monitoring Solutions

3. Custom Monitoring Scripts

Implementing Monitoring Best Practices

1. Define Clear Baselines

2. Set Appropriate Alerting Thresholds

3. Implement a Monitoring Rotation

4. Create Runbooks for Common Issues

Real-World Monitoring Examples

Example 1: E-Commerce Application Monitoring

Example 2: Financial Services Application

Troubleshooting Common Issues with Monitoring

Problem: Unexpected High CPU Usage

Problem: Memory Pressure

Summary

Additional Resources

Exercises