Kubernetes Restore

Introduction

Restoring Kubernetes resources is a critical skill for any Kubernetes administrator. No matter how stable your cluster is, you'll eventually face situations where you need to recover from data loss, configuration errors, or cluster failures. This guide will walk you through the concepts, tools, and techniques for effectively restoring Kubernetes resources from backups.

Understanding Kubernetes Restore

Kubernetes restore is the process of recreating resources and applications in a Kubernetes cluster using previously saved backup data. A proper restore strategy is essential for:

Recovering from accidental deletions
Migrating to new clusters
Rollback after failed upgrades
Disaster recovery scenarios
Meeting compliance and business continuity requirements

Prerequisites

Before diving into restore operations, make sure you have:

Access to your Kubernetes cluster with appropriate permissions
Previously created backups of your resources
Knowledge of your cluster's configuration
Backup/restore tools installed (we'll cover several options)

Basic Restore Concepts

What Can Be Restored?

In Kubernetes, you can restore:

Cluster state - API objects like deployments, services, and configmaps
Application data - Persistent volumes and their contents
Etcd data - The underlying database storing all cluster information
Custom resources - CRDs and their instances

Restore Approaches

Restoring Individual Resources

The simplest form of restore is recreating individual resources from stored manifests.

Using kubectl apply

If you've saved your resource definitions as YAML or JSON files, you can restore them using kubectl apply:

# Restore a single resource
kubectl apply -f deployment-backup.yaml

# Restore multiple resources
kubectl apply -f backup-directory/

# Restore an entire namespace
kubectl apply -f namespace-backup/ -n my-namespace

Example output:

deployment.apps/nginx-deployment created
service/nginx-service created
configmap/nginx-config created

Restoring from etcd Snapshots

For more critical scenarios, you might need to restore the entire etcd database:

# Stop the API server
sudo systemctl stop kube-apiserver

# Restore etcd from snapshot
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
  --data-dir=/var/lib/etcd-restore \
  --initial-cluster="master-1=https://192.168.1.10:2380" \
  --initial-cluster-token="etcd-cluster" \
  --initial-advertise-peer-urls="https://192.168.1.10:2380"

# Update etcd configuration to use restored data
sudo mv /var/lib/etcd /var/lib/etcd.old
sudo mv /var/lib/etcd-restore /var/lib/etcd

# Restart etcd and API server
sudo systemctl restart etcd
sudo systemctl start kube-apiserver

Using Specialized Backup/Restore Tools

Several tools exist specifically for Kubernetes backup and restore operations.

Velero

Velero is one of the most popular Kubernetes backup/restore tools.

Installing Velero

# Install the client
brew install velero  # macOS
# OR
wget https://github.com/vmware-tanzu/velero/releases/download/v1.11.0/velero-v1.11.0-linux-amd64.tar.gz  # Linux

# Install Velero in your cluster (AWS example)
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.6.0 \
  --bucket velero-backups \
  --backup-location-config region=us-east-1 \
  --secret-file ./credentials-velero

Restoring with Velero

# List available backups
velero backup get

# Restore an entire backup
velero restore create --from-backup my-backup

# Restore specific resources
velero restore create --from-backup my-backup --include-resources deployments,services

# Restore to a different namespace
velero restore create --from-backup my-backup --namespace-mappings source-ns:destination-ns

Example output:

Restore request "my-backup-20230615133021" submitted successfully.
Run `velero restore describe my-backup-20230615133021` or `velero restore logs my-backup-20230615133021` for more details.

You can check the status of your restore:

velero restore get

Output:

NAME                          BACKUP        STATUS      STARTED                         COMPLETED                       ERRORS   WARNINGS   CREATED                         SELECTOR
my-backup-20230615133021      my-backup     Completed   2023-06-15 13:30:21 +0000 UTC   2023-06-15 13:32:38 +0000 UTC   0        2          2023-06-15 13:30:21 +0000 UTC   <none>

Kasten K10

Kasten K10 is an enterprise-focused backup and restore solution with a user-friendly dashboard.

# Install Kasten K10
helm repo add kasten https://charts.kasten.io/
helm repo update
helm install k10 kasten/k10 --namespace=kasten-io --create-namespace

Restore operations can be performed through the K10 dashboard or via CLI:

# Create a restore action
kubectl create -f restore-action.yaml

# Example restore-action.yaml
apiVersion: actions.kio.kasten.io/v1alpha1
kind: Restore
metadata:
  name: restore-mysql-app
spec:
  restorePoint:
    backupID: gw3rgbdl
    restorePointID: jx3rgbsq
  options:
    applicationBackupName: "mysql-backup"

Stateful Application Restore

Restoring stateful applications requires extra attention to persistent volumes.

Example: Restoring a PostgreSQL Database

First, create a restore for the PVC using Velero:

velero restore create --from-backup postgres-backup \
  --include-resources persistentvolumeclaims,persistentvolumes

Then deploy the PostgreSQL application pointing to the restored volume:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: "postgres"
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:14
        volumeMounts:
        - name: postgres-data
          mountPath: /var/lib/postgresql/data
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: password
  volumeClaimTemplates:
  - metadata:
      name: postgres-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi

Verify the restore:

kubectl exec -it postgres-0 -- psql -U postgres -c "\l"

Best Practices for Kubernetes Restore

Test your restores regularly
- Don't wait for a disaster to find out if your restore process works
- Schedule regular restore drills to validate procedures
Document restore procedures
- Maintain clear, step-by-step instructions
- Include both automated and manual recovery steps
Use namespaces for isolation
- Restore to temporary namespaces to verify data before replacing production resources
Consider application consistency
- Some applications require special handling for consistent restores
- Database restores may need extra steps beyond volume restoration
Implement restore validation
- Develop automated tests to verify application functionality after restore
- Check that applications can connect to their dependencies
Maintain backup/restore version compatibility
- Ensure your restore tools are compatible with the Kubernetes version you're restoring to
- Test restores across different versions when upgrading clusters

Troubleshooting Common Restore Issues

Incomplete Restores

If resources are missing after a restore:

# Check restore status
velero restore describe my-restore

# Look for resource filtering issues
kubectl get events -n velero

Permission Issues

Restores might fail due to RBAC restrictions:

# Ensure the restore service account has proper permissions
kubectl create clusterrolebinding restore-admin \
  --clusterrole=cluster-admin \
  --serviceaccount=velero:velero

Volume Restore Problems

If PVs aren't properly restored:

# Check volume status
kubectl get pv,pvc

# Look for storage provider errors
kubectl describe pv <problem-pv>

Real-World Restore Scenarios

Scenario 1: Recovery After Accidental Namespace Deletion

# Restore entire namespace and its contents
velero restore create --from-backup daily-backup-20230610 \
  --include-namespaces production-api

Scenario 2: Migrating Applications Between Clusters

# On source cluster
velero backup create migration-backup --include-namespaces app-namespace

# On destination cluster
velero restore create --from-backup migration-backup

Scenario 3: Rollback After Failed Upgrade

# Restore the pre-upgrade state
velero restore create --from-backup pre-upgrade-backup \
  --include-resources deployments,statefulsets,configmaps,secrets \
  --namespace app-namespace

Summary

Kubernetes restore operations are a critical aspect of cluster administration. By understanding the different restore approaches and tools, you can develop effective strategies for various recovery scenarios. Remember to:

Regularly test your restore procedures
Choose the right tool for your specific needs
Consider both resource configuration and persistent data
Document your processes for emergency situations

With proper planning and practice, you'll be well-prepared to handle any recovery situation in your Kubernetes environment.

Additional Resources

Exercises

Set up Velero on a test cluster and perform a backup and restore of a simple application.
Create a disaster recovery plan for a stateful application running in Kubernetes.
Practice restoring a database application and verify data consistency after restore.
Simulate an accidental deletion of critical resources and perform a recovery.
Develop a script to automate the verification of a successful restore operation.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Kubernetes Restore​

Prerequisites​

Basic Restore Concepts​

What Can Be Restored?​

Restore Approaches​

Restoring Individual Resources​

Using kubectl apply​

Restoring from etcd Snapshots​

Using Specialized Backup/Restore Tools​

Velero​

Installing Velero​

Restoring with Velero​

Kasten K10​

Stateful Application Restore​

Example: Restoring a PostgreSQL Database​

Best Practices for Kubernetes Restore​

Troubleshooting Common Restore Issues​

Incomplete Restores​

Permission Issues​

Volume Restore Problems​

Real-World Restore Scenarios​

Scenario 1: Recovery After Accidental Namespace Deletion​

Scenario 2: Migrating Applications Between Clusters​

Scenario 3: Rollback After Failed Upgrade​

Summary​

Additional Resources​

Exercises​