RabbitMQ Production Checklist
Introduction
When moving a RabbitMQ deployment from development to production, there are numerous considerations that can make the difference between a stable, performant system and one that causes headaches for your team and users. This checklist covers essential aspects you should address before and during your RabbitMQ production deployment.
RabbitMQ is a powerful message broker that implements the Advanced Message Queuing Protocol (AMQP). While it's relatively easy to set up for development, a production environment requires careful planning around high availability, monitoring, resource allocation, and security. This guide will walk you through the key considerations to ensure your RabbitMQ deployment is production-ready.
Deployment Planning
Hardware Requirements
Before deploying RabbitMQ to production, ensure your hardware meets the following minimum requirements:
- CPU: RabbitMQ benefits from multiple cores. For production, allocate at least 4 cores.
- Memory: Allocate at least 8GB of RAM for moderate workloads, more for high-volume systems.
- Disk: Use SSDs for optimal performance, especially for the message store.
- Network: Ensure low-latency, high-bandwidth connections between nodes.
Cluster Architecture
// Example cluster config in rabbitmq.conf
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config
cluster_formation.classic_config.nodes.1 = rabbit@node1.example.com
cluster_formation.classic_config.nodes.2 = rabbit@node2.example.com
cluster_formation.classic_config.nodes.3 = rabbit@node3.example.com
For production, consider:
- Minimum of three nodes for proper quorum in a clustered setup
- Network partitioning strategies (choose between pause_minority, autoheal, or ignore)
- Distribution across availability zones for cloud deployments
- Load balancer configuration for client connections
Configuration Optimization
Resource Limits
// Example resource limits in rabbitmq.conf
vm_memory_high_watermark.relative = 0.7
vm_memory_high_watermark_paging_ratio = 0.8
disk_free_limit.absolute = 2GB
- Set appropriate memory high watermark (typically 0.4-0.7 of system RAM)
- Configure disk free space limits to prevent the node from running out of disk space
- Adjust file descriptor limits based on expected connection count
Queue Settings
// Declaring a queue with production settings
channel.queueDeclare(
"critical_orders", // queue name
true, // durable
false, // not exclusive
false, // not auto-delete
Map.of(
"x-max-length", 100000,
"x-overflow", "reject-publish",
"x-queue-type", "quorum"
) // arguments
);
- Use durable queues for important messages (survives broker restart)
- Set appropriate queue limits to prevent unbounded growth
- Consider quorum queues for critical data that needs replication
- Use lazy queues for very large queues with infrequent access
Exchange and Binding Configuration
// Setting up exchanges with production settings
channel.exchangeDeclare("orders", "topic", true, false, null);
// Creating bindings
channel.queueBind("critical_orders", "orders", "order.critical.*");
channel.queueBind("regular_orders", "orders", "order.regular.*");
- Use appropriate exchange types for your messaging patterns (direct, topic, fanout, headers)
- Set up specific binding patterns to route messages efficiently
- Consider alternate exchanges for handling unroutable messages
Security Hardening
Authentication and Authorization
// rabbitmqctl commands for setting up users and permissions
// rabbitmqctl add_user prod_app strong_password
// rabbitmqctl set_permissions -p / prod_app "^prod_app-.*" "^prod_app-.*" "^(prod_app-.*|amq\.default)"
- Create dedicated service accounts with limited permissions
- Remove the default guest user or restrict it to localhost only
- Set up vhosts to isolate different applications or environments
- Enable TLS for client connections and management interface
Network Security
- Use firewall rules to restrict access to RabbitMQ ports
- Set up VLAN or subnet isolation for cluster communication
- Consider VPN connections for clients in different networks
- Implement connection rate limiting to prevent DoS attacks
Monitoring and Alerting
Key Metrics to Monitor
// Sample Prometheus configuration for RabbitMQ monitoring
scrape_configs:
- job_name: rabbitmq
static_configs:
- targets: ['rabbitmq:15692']
RabbitMQ exposes many metrics, but focus on these critical ones:
- Queue length - Alert on abnormal growth patterns
- Consumer utilization - Below 80% may indicate processing issues
- Memory usage - Watch for approaching watermark
- Disk space - Alert well before hitting disk limits
- Message rates - Monitor for unusual spikes or drops
- Connection count - Unexpected changes may indicate issues
Tools and Integration
- Set up the Prometheus plugin for metrics collection
- Create Grafana dashboards with relevant visualizations
- Configure alerts for critical thresholds
- Set up log aggregation with ELK Stack or similar tools
High Availability Setup
Clustering
# On the second node, join the first node
rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl join_cluster rabbit@first-node
rabbitmqctl start_app
# Verify cluster status
rabbitmqctl cluster_status
- Deploy a minimum of three nodes for proper quorum
- Use durable queues and persistent messages for critical data
- Configure proper network partition handling
- Implement quorum queues for replicated message storage
Load Balancing
# HAProxy example configuration for RabbitMQ
frontend rabbitmq_front
bind *:5672
mode tcp
default_backend rabbitmq_back
backend rabbitmq_back
mode tcp
balance roundrobin
server rabbit1 rabbit1.example.com:5672 check
server rabbit2 rabbit2.example.com:5672 check
server rabbit3 rabbit3.example.com:5672 check
- Set up HAProxy or similar load balancer for client connections
- Configure health checks to detect and route around failed nodes
- Consider sticky sessions for certain connection types
Backup and Disaster Recovery
- Implement regular backup procedures for definitions (exchanges, queues, bindings)
- Set up automated snapshots for the Mnesia database
- Create a disaster recovery plan with defined RPO/RTO
- Regularly test the disaster recovery process
# Example backup command for definitions
rabbitmqctl export_definitions /path/to/backup/definitions.json
# Example command to restore definitions
rabbitmqctl import_definitions /path/to/backup/definitions.json
Performance Tuning
Publisher Optimizations
// Example of batching publishes with confirms
channel.confirmSelect();
for (int i = 0; i < 1000; i++) {
channel.basicPublish("orders", "order.regular.1", null, messageBody);
}
channel.waitForConfirmsOrDie(5000);
- Use publisher confirms for guaranteed delivery
- Implement batching for high-throughput scenarios
- Consider asynchronous publishing patterns for better performance
- Set appropriate message persistence levels for your use case
Consumer Optimizations
// Example of setting prefetch count for fair dispatch
channel.basicQos(100, true);
// Setting up a consumer
channel.basicConsume("orders", false, (consumerTag, delivery) -> {
try {
// Process message
processOrder(delivery.getBody());
// Acknowledge successful processing
channel.basicAck(delivery.getEnvelope().getDeliveryTag(), false);
} catch (Exception e) {
// Reject and requeue on error
channel.basicNack(delivery.getEnvelope().getDeliveryTag(), false, true);
}
}, consumerTag -> { });
- Set appropriate prefetch counts based on consumer processing speed
- Implement proper message acknowledgment patterns
- Use multiple consumers for parallel processing
- Consider consumer-side batching for efficiency
Connection Management
// Connection factory configuration
ConnectionFactory factory = new ConnectionFactory();
factory.setHost("rabbitmq.example.com");
factory.setVirtualHost("/");
factory.setUsername("prod_app");
factory.setPassword("strong_password");
factory.setAutomaticRecoveryEnabled(true);
factory.setNetworkRecoveryInterval(10000);
// Connection sharing
Connection connection = factory.newConnection();
Channel channel1 = connection.createChannel();
Channel channel2 = connection.createChannel();
- Share connections across threads, but use separate channels
- Implement automatic connection recovery
- Set appropriate heartbeat intervals (typically 30-60 seconds)
- Implement proper connection closure on application shutdown
Production Deployment Checklist
Use this final checklist before launching your production RabbitMQ deployment:
✅ Cluster Configuration
- At least 3 nodes for HA setup
- Network partition handling strategy defined
- Node resource limits configured
✅ Security
- Default guest user removed or restricted
- TLS enabled for connections
- Proper user permissions and vhosts configured
- Firewall rules in place
✅ Monitoring
- Prometheus/Grafana dashboards set up
- Alerts configured for critical metrics
- Log aggregation in place
- Regular health checks implemented
✅ Data Safety
- Persistent messages for critical data
- Regular definition backups scheduled
- Disaster recovery plan documented and tested
✅ Client Configuration
- Connection recovery implemented in clients
- Publisher confirms for important messages
- Proper consumer acknowledgment patterns
- Error handling and dead letter exchanges configured
Troubleshooting Common Issues
High CPU Usage
- Check for long queue backlogs causing high message processing load
- Look for poorly optimized queue bindings with complex routing
- Monitor for high connection churn (frequent connect/disconnect)
Memory Issues
# Show memory breakdown by category
rabbitmqctl status | grep memory
# List queues sorted by memory usage
rabbitmqctl list_queues name memory
- Adjust vm_memory_high_watermark if messages are being blocked too early
- Check for large queues consuming excessive memory
- Monitor connection and channel count for unexpected growth
Unexpected Node Failures
- Check system logs for resource exhaustion (disk, file descriptors)
- Verify network stability between nodes
- Review cluster partition events and configured handling strategy
Summary
Preparing RabbitMQ for production involves careful planning across multiple dimensions: hardware resources, cluster architecture, security, monitoring, high availability, and performance tuning. By following this checklist, you can avoid common pitfalls and deploy a robust messaging system that meets your application's needs.
Remember that a production-ready RabbitMQ deployment is not a "set it and forget it" affair. Regular monitoring, maintenance, and periodic review of your configuration against changing workloads are essential to maintain optimal performance and reliability.
Additional Resources
- RabbitMQ Official Documentation
- RabbitMQ Production Checklist (Official)
- RabbitMQ Monitoring
- Clustering Guide
Practice Exercises
- Set up a three-node RabbitMQ cluster on your development environment and test failover scenarios.
- Configure Prometheus and Grafana for monitoring key RabbitMQ metrics.
- Implement a publisher-consumer application with proper production patterns (confirms, acknowledgments, connection recovery).
- Create a disaster recovery plan for your RabbitMQ deployment and test the recovery process.
- Perform a load test on your RabbitMQ deployment and tune the configuration based on the results.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)