Terraform for Disaster Recovery
Introduction
Disaster Recovery (DR) is a critical aspect of any organization's infrastructure strategy. It involves planning and implementing systems that can quickly recover from catastrophic failures, whether caused by natural disasters, human errors, or cyber attacks. In this guide, we'll explore how Terraform can be leveraged to implement effective disaster recovery strategies across your infrastructure.
Terraform's infrastructure-as-code approach provides several advantages for disaster recovery:
- Consistency: Define infrastructure once, deploy it anywhere
- Repeatability: Reliable recreation of environments
- Version Control: Track changes and roll back when needed
- Automation: Reduce recovery time through automated processes
- Multi-cloud Support: Implement DR strategies across different cloud providers
Understanding Disaster Recovery Concepts
Before diving into Terraform implementations, let's understand key disaster recovery metrics:
Recovery Time Objective (RTO)
RTO defines the maximum acceptable time to restore a service after a disaster. For example, a critical payment system might have an RTO of 4 hours, while a marketing website might have an RTO of 24 hours.
Recovery Point Objective (RPO)
RPO defines the maximum acceptable data loss measured in time. For example, an RPO of 1 hour means you can lose up to 1 hour of data during a disaster.
Disaster Recovery Strategies
Terraform can help implement various DR strategies:
1. Backup and Restore
The simplest DR strategy involves regular backups that can be restored when needed. While this approach has higher RPO/RTO values, it's cost-effective for non-critical systems.
resource "aws_backup_plan" "example" {
name = "tf-example-backup-plan"
rule {
rule_name = "tf-example-backup-rule"
target_vault_name = aws_backup_vault.test.name
schedule = "cron(0 12 * * ? *)" # Daily backup at 12:00 UTC
lifecycle {
delete_after = 14 # Keep backups for 14 days
}
}
}
resource "aws_backup_selection" "example" {
name = "tf-example-backup-selection"
iam_role_arn = aws_iam_role.example.arn
plan_id = aws_backup_plan.example.id
resources = [
aws_db_instance.example.arn,
aws_ebs_volume.example.arn
]
}
resource "aws_backup_vault" "test" {
name = "tf-example-backup-vault"
}
2. Pilot Light
The Pilot Light approach maintains a minimal version of your environment in a secondary region, with critical components like databases kept in sync. During disasters, you can quickly scale up the environment.
# Primary Region Resources
provider "aws" {
region = "us-west-2"
alias = "primary"
}
# DR Region Resources
provider "aws" {
region = "us-east-1"
alias = "dr"
}
# Primary Database
resource "aws_db_instance" "primary" {
provider = aws.primary
identifier = "primary-db"
engine = "mysql"
instance_class = "db.t3.large"
allocated_storage = 100
backup_retention_period = 7
multi_az = true
# Other configuration...
}
# DR Database (smaller instance in standby mode)
resource "aws_db_instance" "dr" {
provider = aws.dr
identifier = "dr-db"
engine = "mysql"
instance_class = "db.t3.small"
allocated_storage = 100
replicate_source_db = aws_db_instance.primary.identifier
# Other configuration...
}
3. Warm Standby
This strategy maintains a scaled-down but fully functional copy of your production environment in the DR region. This reduces RTO but increases costs.
module "primary_region_app" {
source = "./modules/webapp"
providers = {
aws = aws.primary
}
environment = "production"
instance_count = 6
instance_type = "t3.large"
# Other configuration...
}
module "dr_region_app" {
source = "./modules/webapp"
providers = {
aws = aws.dr
}
environment = "dr"
instance_count = 2 # Scaled down but ready
instance_type = "t3.medium"
# Other configuration...
}
4. Hot Standby / Multi-Region Active-Active
The most robust (and costly) approach maintains full production environments in multiple regions, often with active-active configurations. This provides near-zero RTO and RPO.
# Define a module for our application stack
module "app_stack" {
source = "./modules/app_stack"
for_each = {
primary = {
region = "us-west-2"
is_primary = true
}
secondary = {
region = "us-east-1"
is_primary = false
}
}
providers = {
aws = aws.${each.key}
}
region = each.value.region
environment = each.key
is_primary = each.value.is_primary
}
# Route 53 Global Load Balancer
resource "aws_route53_record" "www" {
zone_id = aws_route53_zone.primary.zone_id
name = "www.example.com"
type = "A"
failover_routing_policy {
type = "PRIMARY"
}
alias {
name = module.app_stack["primary"].lb_dns_name
zone_id = module.app_stack["primary"].lb_zone_id
evaluate_target_health = true
}
health_check_id = aws_route53_health_check.primary.id
}
resource "aws_route53_record" "www_secondary" {
zone_id = aws_route53_zone.primary.zone_id
name = "www.example.com"
type = "A"
failover_routing_policy {
type = "SECONDARY"
}
alias {
name = module.app_stack["secondary"].lb_dns_name
zone_id = module.app_stack["secondary"].lb_zone_id
evaluate_target_health = true
}
}
Implementing Cross-Region State Management
For Terraform to be effective in DR scenarios, you need to ensure your Terraform state is available across regions. Using a remote backend with appropriate replication is essential.
terraform {
backend "s3" {
bucket = "terraform-state-dr-example"
key = "global/s3/terraform.tfstate"
region = "us-west-2"
encrypt = true
dynamodb_table = "terraform-locks"
# For cross-region replication
replica_regions = [{
region = "us-east-1"
}]
}
}
Creating a DR Testing Plan with Terraform
Regular testing of your DR strategy is crucial. Terraform can help automate these tests:
# Module to create testing environment
module "dr_test" {
source = "./modules/dr_test"
# Only create when testing
count = var.enable_dr_test ? 1 : 0
# Pass variables needed for testing
vpc_id = module.app_stack["secondary"].vpc_id
subnet_ids = module.app_stack["secondary"].subnet_ids
security_groups = module.app_stack["secondary"].security_groups
}
# Variables to control testing
variable "enable_dr_test" {
description = "Enable DR testing environment"
type = bool
default = false
}
Automating Failover with Terraform and Scripts
While Terraform is great for infrastructure setup, you may need additional scripts for failover orchestration. Here's how you can combine them:
# Store failover scripts in S3
resource "aws_s3_bucket_object" "failover_script" {
bucket = aws_s3_bucket.scripts.id
key = "scripts/failover.sh"
source = "${path.module}/scripts/failover.sh"
etag = filemd5("${path.module}/scripts/failover.sh")
}
# Lambda function to execute failover
resource "aws_lambda_function" "dr_failover" {
function_name = "dr-failover-orchestrator"
handler = "index.handler"
runtime = "nodejs14.x"
filename = "${path.module}/lambda/failover_function.zip"
role = aws_iam_role.lambda_role.arn
environment {
variables = {
SCRIPT_BUCKET = aws_s3_bucket.scripts.id
SCRIPT_KEY = aws_s3_bucket_object.failover_script.key
PRIMARY_REGION = "us-west-2"
DR_REGION = "us-east-1"
}
}
}
The failover script (simplified example):
#!/bin/bash
# failover.sh
# 1. Update Route53 to point to DR region
aws route53 change-resource-record-sets \
--hosted-zone-id $HOSTED_ZONE_ID \
--change-batch '{
"Changes": [
{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "www.example.com",
"Type": "A",
"SetIdentifier": "failover",
"Failover": "PRIMARY",
"AliasTarget": {
"HostedZoneId": "'$DR_LB_ZONE_ID'",
"DNSName": "'$DR_LB_DNS_NAME'",
"EvaluateTargetHealth": true
}
}
}
]
}'
# 2. Scale up DR environment
terraform -chdir=/path/to/terraform apply -var 'dr_environment_scale=production' -auto-approve
Real-World Example: Multi-Region Web Application DR
Let's put everything together with a real-world example of a web application with a database backend:
# modules/webapp/main.tf
variable "environment" {
description = "Environment name"
type = string
}
variable "region" {
description = "AWS region"
type = string
}
variable "instance_count" {
description = "Number of EC2 instances"
type = number
}
# VPC and Networking
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 3.0"
name = "${var.environment}-vpc"
cidr = "10.0.0.0/16"
azs = ["${var.region}a", "${var.region}b", "${var.region}c"]
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = var.environment != "production"
}
# Database
resource "aws_db_instance" "main" {
allocated_storage = 20
storage_type = "gp2"
engine = "postgres"
engine_version = "13.4"
instance_class = var.environment == "production" ? "db.t3.large" : "db.t3.medium"
name = "app_db"
username = "app_user"
password = var.db_password
multi_az = var.environment == "production"
backup_retention_period = var.environment == "production" ? 7 : 1
storage_encrypted = true
# Enable snapshot copying across regions for DR
dynamic "copy_tags_to_snapshot" {
for_each = var.environment == "production" ? [1] : []
content {
copy_tags_to_snapshot = true
}
}
}
# Web Application
resource "aws_launch_template" "app" {
name_prefix = "${var.environment}-app-"
image_id = data.aws_ami.app.id
instance_type = var.environment == "production" ? "t3.large" : "t3.small"
user_data = base64encode(templatefile("${path.module}/scripts/user_data.sh", {
db_endpoint = aws_db_instance.main.endpoint
region = var.region
environment = var.environment
}))
iam_instance_profile {
name = aws_iam_instance_profile.app.name
}
# Other configuration...
}
resource "aws_autoscaling_group" "app" {
name = "${var.environment}-app-asg"
vpc_zone_identifier = module.vpc.private_subnets
min_size = var.environment == "dr" ? 1 : var.instance_count
max_size = var.environment == "dr" ? var.instance_count * 2 : var.instance_count * 2
desired_capacity = var.environment == "dr" ? 1 : var.instance_count
launch_template {
id = aws_launch_template.app.id
version = "$Latest"
}
target_group_arns = [aws_lb_target_group.app.arn]
# Other configuration...
}
# Load Balancer
resource "aws_lb" "app" {
name = "${var.environment}-app-lb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.lb.id]
subnets = module.vpc.public_subnets
}
# Outputs
output "lb_dns_name" {
value = aws_lb.app.dns_name
}
output "lb_zone_id" {
value = aws_lb.app.zone_id
}
output "vpc_id" {
value = module.vpc.vpc_id
}
output "subnet_ids" {
value = module.vpc.private_subnets
}
Creating a DR Runbook with Terraform
A well-documented runbook is critical for successful DR. Here's an example structure for your Terraform-based DR runbook:
-
Prerequisites
- Terraform installation
- AWS CLI configuration
- Required permissions
-
Failover Procedure
bash# 1. Verify disaster and decide to failover
# 2. Switch to DR environment
cd terraform/environments/dr
# 3. Scale up DR environment
terraform apply -var 'app_instance_count=6' -auto-approve
# 4. Update DNS
./scripts/update_route53.sh
# 5. Verify application functionality
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)