Ansible Backup Recovery

Introduction

In modern infrastructure management, ensuring reliable backup and recovery procedures is critical for maintaining business continuity and protecting against data loss. Ansible, a powerful automation tool, offers robust capabilities for creating, managing, and restoring backups across your entire infrastructure. This guide will walk you through implementing effective backup and recovery strategies using Ansible, allowing you to automate these essential processes and reduce the risk of human error.

Why Use Ansible for Backup and Recovery?

Ansible provides several advantages for backup and recovery operations:

Idempotency: Run the same playbook multiple times without causing unexpected side effects
Agentless architecture: No need to install backup agents on target systems
Parallel execution: Run backup operations across multiple systems simultaneously
Version control: Store your backup configurations as code in your repository
Scheduled automation: Combine with cron or other scheduling mechanisms for fully automated backups

Basic Backup Strategy with Ansible

Let's start by creating a simple playbook to back up configuration files on remote servers:

yaml
---
# backup_configs.yml
- name: Backup configuration files
  hosts: webservers
  become: yes
  vars:
    backup_dir: "/var/backups/ansible"
    timestamp: "{{ ansible_date_time.date }}_{{ ansible_date_time.time | replace(':', '-') }}"
  
  tasks:
    - name: Ensure backup directory exists
      file:
        path: "{{ backup_dir }}"
        state: directory
        mode: '0755'
    
    - name: Backup nginx configuration
      copy:
        src: /etc/nginx/nginx.conf
        dest: "{{ backup_dir }}/nginx.conf.{{ timestamp }}"
        remote_src: yes
    
    - name: Backup application config
      copy:
        src: /opt/myapp/config.yml
        dest: "{{ backup_dir }}/myapp_config.{{ timestamp }}"
        remote_src: yes
    
    - name: Create backup archive
      archive:
        path: "{{ backup_dir }}/*.{{ timestamp }}"
        dest: "{{ backup_dir }}/config_backup_{{ timestamp }}.tar.gz"
        format: gz
    
    - name: Clean up temporary files
      file:
        path: "{{ item }}"
        state: absent
      with_fileglob:
        - "{{ backup_dir }}/*.{{ timestamp }}"

In this playbook:

We define a timestamp variable to ensure backup uniqueness
We create a backup directory if it doesn't exist
We copy important configuration files with the timestamp appended
We archive the collected files and clean up temporary files

Database Backup with Ansible

Backing up databases is a common requirement. Here's how to back up a MySQL/MariaDB database:

yaml
---
# backup_database.yml
- name: Backup MySQL databases
  hosts: db_servers
  become: yes
  vars:
    backup_dir: "/var/backups/mysql"
    timestamp: "{{ ansible_date_time.date }}_{{ ansible_date_time.time | replace(':', '-') }}"
    mysql_user: "backup_user"
    mysql_password: "{{ vault_mysql_backup_password }}"  # Use Ansible Vault to store passwords securely
  
  tasks:
    - name: Ensure backup directory exists
      file:
        path: "{{ backup_dir }}"
        state: directory
        mode: '0755'
    
    - name: Get list of databases
      command: mysql -u {{ mysql_user }} -p{{ mysql_password }} -e "SHOW DATABASES;"
      register: database_list
      changed_when: false
      no_log: true
    
    - name: Parse database list
      set_fact:
        databases: "{{ database_list.stdout_lines | select('match', '^[a-zA-Z]') | list | difference(['information_schema', 'performance_schema', 'mysql', 'sys']) }}"
    
    - name: Backup each database
      command: mysqldump -u {{ mysql_user }} -p{{ mysql_password }} --single-transaction --quick --lock-tables=false {{ item }} | gzip > {{ backup_dir }}/{{ item }}_{{ timestamp }}.sql.gz
      with_items: "{{ databases }}"
      no_log: true
    
    - name: Set proper permissions on backup files
      file:
        path: "{{ backup_dir }}"
        mode: '0600'
        recurse: yes

This playbook:

Discovers available databases automatically
Excludes system databases
Creates compressed SQL dumps of each database
Sets proper permissions on backup files

Securing Your Backups

When implementing backup strategies, security is paramount. Let's encrypt our backups using Ansible:

yaml
---
# encrypt_backups.yml
- name: Encrypt backup archives
  hosts: backup_server
  become: yes
  vars:
    backup_dir: "/var/backups"
    gpg_recipient: "[email protected]"  # GPG key to use for encryption
  
  tasks:
    - name: Find backup archives
      find:
        paths: "{{ backup_dir }}"
        patterns: "*.tar.gz,*.sql.gz"
        recurse: yes
        file_type: file
        excludes: "*.gpg"  # Don't include already encrypted files
      register: backup_files
    
    - name: Encrypt backup files
      command: gpg --batch --yes --trust-model always --recipient {{ gpg_recipient }} --encrypt {{ item.path }}
      with_items: "{{ backup_files.files }}"
      register: encryption_result
    
    - name: Remove original files if encryption successful
      file:
        path: "{{ item.path }}"
        state: absent
      with_items: "{{ backup_files.files }}"
      when: encryption_result is succeeded

Implementing a Centralized Backup Strategy

For larger infrastructures, centralizing backups is recommended. Here's how to collect backups from multiple servers:

yaml
---
# centralized_backup.yml
- name: Collect backups from servers
  hosts: backup_server
  become: yes
  vars:
    central_backup_dir: "/mnt/backup_storage"
    timestamp: "{{ ansible_date_time.date }}"
  
  tasks:
    - name: Ensure central backup directory exists
      file:
        path: "{{ central_backup_dir }}/{{ timestamp }}"
        state: directory
        mode: '0755'
    
    - name: Fetch backups from web servers
      synchronize:
        src: "/var/backups/ansible/"
        dest: "{{ central_backup_dir }}/{{ timestamp }}/webservers/"
        mode: pull
      delegate_to: "{{ inventory_hostname }}"
      with_items: "{{ groups['webservers'] }}"
    
    - name: Fetch backups from database servers
      synchronize:
        src: "/var/backups/mysql/"
        dest: "{{ central_backup_dir }}/{{ timestamp }}/databases/"
        mode: pull
      delegate_to: "{{ inventory_hostname }}"
      with_items: "{{ groups['db_servers'] }}"
    
    - name: Create backup manifest
      template:
        src: templates/backup_manifest.j2
        dest: "{{ central_backup_dir }}/{{ timestamp }}/manifest.yml"

Automating Backup Rotation and Retention

Managing backup retention is crucial to prevent running out of storage. Here's how to implement a backup rotation policy:

yaml
---
# backup_rotation.yml
- name: Implement backup rotation
  hosts: backup_server
  become: yes
  vars:
    backup_dir: "/mnt/backup_storage"
    daily_retention: 7  # Keep daily backups for 7 days
    weekly_retention: 4  # Keep weekly backups for 4 weeks
    monthly_retention: 6  # Keep monthly backups for 6 months
  
  tasks:
    - name: Find daily backups older than retention period
      find:
        paths: "{{ backup_dir }}"
        patterns: "daily_*"
        age: "{{ daily_retention }}d"
        recurse: no
        file_type: directory
      register: old_daily_backups
    
    - name: Remove expired daily backups
      file:
        path: "{{ item.path }}"
        state: absent
      with_items: "{{ old_daily_backups.files }}"
    
    - name: Find weekly backups older than retention period
      find:
        paths: "{{ backup_dir }}"
        patterns: "weekly_*"
        age: "{{ weekly_retention * 7 }}d"
        recurse: no
        file_type: directory
      register: old_weekly_backups
    
    - name: Remove expired weekly backups
      file:
        path: "{{ item.path }}"
        state: absent
      with_items: "{{ old_weekly_backups.files }}"
    
    - name: Find monthly backups older than retention period
      find:
        paths: "{{ backup_dir }}"
        patterns: "monthly_*"
        age: "{{ monthly_retention * 30 }}d"
        recurse: no
        file_type: directory
      register: old_monthly_backups
    
    - name: Remove expired monthly backups
      file:
        path: "{{ item.path }}"
        state: absent
      with_items: "{{ old_monthly_backups.files }}"

Testing Backup Recovery

Creating backups is only half the solution—you need to verify they can be restored. Let's create a playbook for testing recovery:

yaml
---
# test_recovery.yml
- name: Test backup recovery
  hosts: test_recovery_server
  become: yes
  vars:
    backup_dir: "/mnt/backup_storage/latest"
    restore_dir: "/opt/restore_test"
    test_database: "test_recovery_db"
    mysql_user: "restore_user"
    mysql_password: "{{ vault_mysql_restore_password }}"
  
  tasks:
    - name: Ensure restore directory exists
      file:
        path: "{{ restore_dir }}"
        state: directory
        mode: '0755'
    
    - name: Extract configuration backup
      unarchive:
        src: "{{ backup_dir }}/webservers/config_backup_latest.tar.gz"
        dest: "{{ restore_dir }}/configs"
        remote_src: yes
    
    - name: Create test database
      mysql_db:
        name: "{{ test_database }}"
        state: present
        login_user: "{{ mysql_user }}"
        login_password: "{{ mysql_password }}"
    
    - name: Restore database from backup
      mysql_db:
        name: "{{ test_database }}"
        state: import
        target: "{{ backup_dir }}/databases/customer_db_latest.sql.gz"
        login_user: "{{ mysql_user }}"
        login_password: "{{ mysql_password }}"
    
    - name: Verify database recovery
      command: mysql -u {{ mysql_user }} -p{{ mysql_password }} -e "SELECT COUNT(*) FROM users;" {{ test_database }}
      register: db_test_result
      changed_when: false
      no_log: true
    
    - name: Display test results
      debug:
        msg: "Recovery test completed. Database has {{ db_test_result.stdout_lines[-1] }} user records."

Implementing a Complete Workflow

Putting everything together, let's create a diagram of a complete backup and recovery workflow:

Real-World Example: Complete Disaster Recovery

Let's look at a comprehensive disaster recovery playbook:

yaml
---
# disaster_recovery.yml
- name: Disaster Recovery Procedure
  hosts: all
  become: yes
  vars:
    backup_source: "/mnt/backup_storage/latest"
  
  tasks:
    - name: Check server status
      ping:
      register: connectivity
      ignore_errors: yes
      
    - name: Create server groups based on status
      group_by:
        key: "{{ 'online_servers' if connectivity is succeeded else 'offline_servers' }}"

- name: Restore offline servers
  hosts: offline_servers
  become: yes
  gather_facts: no
  
  tasks:
    - name: Attempt to power on servers via out-of-band management
      command: ipmitool -I lanplus -H {{ hostvars[inventory_hostname]['ipmi_address'] }} -U {{ ipmi_user }} -P {{ ipmi_password }} power on
      delegate_to: localhost
      when: hostvars[inventory_hostname]['ipmi_address'] is defined
      ignore_errors: yes
      
    - name: Wait for server to come online
      wait_for:
        host: "{{ inventory_hostname }}"
        port: 22
        delay: 30
        timeout: 300
      delegate_to: localhost

- name: Restore systems
  hosts: all
  become: yes
  serial: 5  # Restore 5 servers at a time
  vars:
    backup_source: "/mnt/backup_storage/latest"
    
  tasks:
    - name: Mount recovery media
      mount:
        path: /mnt/recovery
        src: "{{ backup_server }}:/backups/{{ inventory_hostname }}"
        fstype: nfs
        state: mounted
      
    - name: Restore system configuration
      include_role:
        name: system_restore
      
    - name: Restore application stack
      include_role:
        name: application_restore
      
    - name: Validate restoration
      include_role:
        name: validation
      register: validation_result
      
    - name: Report recovery status
      debug:
        msg: "Recovery of {{ inventory_hostname }} {{ 'successful' if validation_result.success else 'failed' }}"
      
    - name: Add to success group
      group_by:
        key: "recovery_success"
      when: validation_result.success|bool
      
    - name: Add to failure group
      group_by:
        key: "recovery_failure"
      when: not validation_result.success|bool

- name: Final recovery report
  hosts: localhost
  gather_facts: no
  tasks:
    - name: Generate recovery report
      template:
        src: templates/recovery_report.j2
        dest: "/var/log/ansible/recovery_{{ ansible_date_time.date }}.log"
      
    - name: Send notification
      mail:
        subject: "Disaster Recovery Operation Completed"
        to: [email protected]
        body: "{{ lookup('file', '/var/log/ansible/recovery_{{ ansible_date_time.date }}.log') }}"
      when: groups['recovery_success'] is defined or groups['recovery_failure'] is defined

Best Practices for Ansible Backup and Recovery

Test regularly: Schedule periodic recovery tests to ensure your backup strategy works.
Use Ansible Vault: Encrypt sensitive data like database credentials using Ansible Vault.
Document thoroughly: Maintain comprehensive documentation of your backup and recovery procedures.
Monitor and alert: Implement monitoring and alerting for backup jobs.
Version control: Keep your Ansible playbooks in version control for change tracking.
Immutable backups: Consider implementing immutable backups that cannot be altered once created.
Geographic redundancy: Store backups in multiple geographic locations.

Monitoring Backup Operations

Monitoring your backup operations is critical. Here's a simple playbook to check backup status:

yaml
---
# monitor_backups.yml
- name: Monitor backup status
  hosts: backup_server
  become: yes
  vars:
    backup_dir: "/mnt/backup_storage"
    alert_email: "[email protected]"
    max_age_hours: 24
  
  tasks:
    - name: Check for recent backups
      find:
        paths: "{{ backup_dir }}"
        file_type: directory
        age: "{{ max_age_hours }}h"
        age_stamp: mtime
        recurse: no
      register: recent_backups
    
    - name: Check backup file integrity
      command: find {{ backup_dir }}/latest -type f -name "*.gz" -exec gunzip -t {} \;
      register: integrity_check
      changed_when: false
      failed_when: false
    
    - name: Send alert if no recent backups
      mail:
        subject: "ALERT: No recent backups found"
        to: "{{ alert_email }}"
        body: "No backups have been created in the last {{ max_age_hours }} hours. Please investigate."
      when: recent_backups.matched == 0
    
    - name: Send alert if integrity check failed
      mail:
        subject: "ALERT: Backup integrity check failed"
        to: "{{ alert_email }}"
        body: "The integrity check for backups has failed. Error: {{ integrity_check.stderr }}"
      when: integrity_check.rc != 0

Summary

Implementing an automated backup and recovery strategy with Ansible provides significant benefits for infrastructure management:

Consistency: Ensure backups are taken consistently across your entire infrastructure
Reliability: Automated processes reduce the risk of human error
Efficiency: Save time and resources through automation
Scalability: Easily adapt your backup strategy as your infrastructure grows
Auditability: Maintain clear records of backup and recovery operations

By following the practices outlined in this guide, you can build a robust backup and recovery system using Ansible that will help protect your infrastructure from data loss and minimize downtime in case of disasters.

Additional Resources

Practice implementing the playbooks in a test environment
Expand the playbooks to include specific backup procedures for your applications
Create custom roles for different types of systems in your infrastructure
Consider integrating with a backup monitoring system for enhanced visibility

Exercises

Modify the database backup playbook to include specific options for backing up PostgreSQL databases.
Create a playbook that tests restoration into a containerized environment.
Implement a pre-backup verification step that checks for available disk space.
Design a role-based structure for organizing your backup playbooks.
Extend the monitoring playbook to send notifications through different channels (Slack, Teams, etc.).

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Why Use Ansible for Backup and Recovery?​

Basic Backup Strategy with Ansible​

Database Backup with Ansible​

Securing Your Backups​

Implementing a Centralized Backup Strategy​

Automating Backup Rotation and Retention​

Testing Backup Recovery​

Implementing a Complete Workflow​

Real-World Example: Complete Disaster Recovery​

Best Practices for Ansible Backup and Recovery​

Monitoring Backup Operations​

Summary​

Additional Resources​

Exercises​