Kubernetes Cluster Scaling

Introduction

Scaling is one of the most powerful features of Kubernetes. As your applications face varying workloads, the ability to scale resources up or down becomes crucial for maintaining performance while optimizing costs. Kubernetes offers several approaches to scaling, from manually adjusting the number of running pods to sophisticated automatic scaling based on resource utilization metrics.

In this guide, we'll explore different scaling strategies for Kubernetes clusters, how to implement them, and best practices to ensure your applications remain responsive regardless of traffic patterns.

Understanding Scaling Types in Kubernetes

Before diving into implementation, let's understand the different types of scaling available in Kubernetes:

1. Horizontal Pod Autoscaling (HPA)

Horizontal scaling increases or decreases the number of pod replicas in a deployment or replication controller. This is useful when your application can distribute load across multiple instances.

2. Vertical Pod Autoscaling (VPA)

Vertical scaling adjusts the CPU and memory resources allocated to existing pods. This is helpful for applications that can't be easily horizontally scaled.

3. Cluster Autoscaling (CA)

Cluster scaling adjusts the number of nodes in your cluster based on resource requirements. This ensures you have enough infrastructure to run your workloads.

Horizontal Pod Autoscaling

How Horizontal Pod Autoscaling Works

Horizontal Pod Autoscaling automatically scales the number of pods in a deployment, replication controller, or replica set based on observed CPU utilization or custom metrics.

Setting Up an HPA

To use the Horizontal Pod Autoscaler, you need the metrics server installed in your cluster. Most managed Kubernetes services have this pre-installed.

Here's a simple example of creating an HPA:

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: example-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: example-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

This HPA will:

Target a deployment named example-app
Maintain between 2 and 10 replicas
Scale based on CPU utilization, targeting 50% average utilization

You can apply this configuration using:

bash
kubectl apply -f hpa.yaml

Checking HPA Status

You can check the status of your HPAs using:

bash
kubectl get hpa

Example output:

NAME         REFERENCE               TARGETS    MINPODS   MAXPODS   REPLICAS   AGE
example-app  Deployment/example-app  85%/50%   2         10        6          2m

In this example, the current CPU utilization is 85%, causing the HPA to scale up to 6 replicas.

Scaling Based on Custom Metrics

You can also scale based on custom metrics using Prometheus and the Prometheus Adapter. Here's an example HPA that scales based on requests per second:

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: example-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: example-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 100

Vertical Pod Autoscaling

How Vertical Pod Autoscaling Works

Vertical Pod Autoscaling (VPA) automatically adjusts the CPU and memory resource requests and limits of containers in pods. This is useful for applications that can't be horizontally scaled.

Setting Up VPA

To use VPA, you first need to install the Vertical Pod Autoscaler operator:

bash
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler/
./hack/vpa-up.sh

Here's an example VPA configuration:

yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: example-app-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: example-app
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: '*'
        minAllowed:
          cpu: 100m
          memory: 50Mi
        maxAllowed:
          cpu: 1
          memory: 500Mi

This VPA will:

Target a deployment named example-app
Automatically update resource requirements
Set min/max boundaries for CPU and memory

VPA Modes

VPA supports three update modes:

Off: VPA only provides recommendations, doesn't apply them
Initial: VPA applies recommendations only at pod creation
Auto: VPA automatically updates resource requirements for running pods

Checking VPA Status

You can check VPA recommendations with:

bash
kubectl describe vpa example-app-vpa

Cluster Autoscaling

How Cluster Autoscaling Works

Cluster Autoscaler adjusts the size of your Kubernetes cluster when:

There are pods that failed to run due to insufficient resources
There are nodes that have been underutilized for a significant amount of time

Setting Up Cluster Autoscaler

Setting up Cluster Autoscaler varies by cloud provider. Here's an example for Google Kubernetes Engine (GKE):

bash
gcloud container clusters update example-cluster \
  --enable-autoscaling \
  --min-nodes=3 \
  --max-nodes=10 \
  --zone=us-central1-a

For AWS EKS, you would typically deploy the autoscaler as a deployment:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    app: cluster-autoscaler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
        - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.23.0
          name: cluster-autoscaler
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<YOUR-CLUSTER-NAME>
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi

Node Affinity and Taints for Controlled Scaling

You can control where pods are scheduled using node affinity and taints, which helps the Cluster Autoscaler make better decisions:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: example-app
  template:
    metadata:
      labels:
        app: example-app
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-type
                operator: In
                values:
                - compute-optimized
      containers:
      - name: example-app
        image: example/app:latest
        resources:
          requests:
            memory: "256Mi"
            cpu: "500m"
          limits:
            memory: "512Mi"
            cpu: "1000m"

Practical Real-World Scaling Scenarios

Scenario 1: Handling Daily Traffic Patterns

Many applications experience predictable traffic patterns. For example, an e-commerce site might see more traffic during evenings and weekends. In this case, you might use HPA in combination with Cluster Autoscaler:

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: shopping-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: shopping-app
  minReplicas: 5  # Higher baseline during business hours
  maxReplicas: 20
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60  # Scale up quickly
    scaleDown:
      stabilizationWindowSeconds: 300  # Scale down slowly
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Scenario 2: Handling Unexpected Traffic Spikes

For applications that might experience sudden, unpredictable traffic spikes (like a news site), you'll want a more aggressive scaling strategy:

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: news-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: news-app
  minReplicas: 5
  maxReplicas: 50  # Allow more headroom for unexpected spikes
  behavior:
    scaleUp:
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15  # Double pods every 15 seconds if needed
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50  # More aggressive target

Scenario 3: Batch Processing Jobs

For batch processing workloads, you might want to scale nodes with specific capabilities:

yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: data-processing-job
spec:
  parallelism: 10
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node.kubernetes.io/instance-type
                operator: In
                values:
                - c5.2xlarge
      containers:
      - name: data-processor
        image: data-processor:latest
        resources:
          requests:
            cpu: 1
            memory: 4Gi
          limits:
            cpu: 2
            memory: 8Gi
      restartPolicy: Never

Combined with a properly configured Cluster Autoscaler, this will provision the right type of nodes for the job and scale them back when complete.

Best Practices for Kubernetes Scaling

1. Set Resource Requests and Limits

Always set proper resource requests and limits for your containers:

yaml
resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"
    cpu: "500m"

This helps Kubernetes make better scheduling and scaling decisions.

2. Start with Conservative Autoscaling Settings

Begin with conservative settings and adjust based on observed behavior:

HPA: Start with a higher CPU threshold (70-80%)
VPA: Use the "Initial" or "Off" update mode first
Cluster Autoscaler: Set scale-down delay to a higher value (10-15 minutes)

3. Monitor Your Scaling Behavior

Use Kubernetes dashboard, Prometheus, and Grafana to monitor scaling events:

bash
kubectl get events --field-selector reason=ScalingReplicaSet

4. Use Pod Disruption Budgets

Protect your applications during scaling with Pod Disruption Budgets:

yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: example-app-pdb
spec:
  minAvailable: 2  # or maxUnavailable: 1
  selector:
    matchLabels:
      app: example-app

5. Configure Proper Readiness and Liveness Probes

Ensure Kubernetes knows when your pods are ready to receive traffic:

yaml
readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20

Advanced Topics

Multi-Dimensional Scaling with KEDA

Kubernetes Event-Driven Autoscaling (KEDA) allows scaling based on event sources like message queues:

yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: rabbitmq-scaledobject
spec:
  scaleTargetRef:
    name: consumer-app
  minReplicaCount: 1
  maxReplicaCount: 30
  triggers:
  - type: rabbitmq
    metadata:
      protocol: amqp
      queueName: orders
      host: rabbitmq.default.svc:5672
      queueLength: "50"

Scaling Based on Custom Business Metrics

You can scale based on business metrics like response time:

yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: example-app
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: example-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: External
    external:
      metric:
        name: nginx_ingress_controller_response_duration_seconds_sum
      target:
        type: AverageValue
        averageValue: 0.1

Summary

Kubernetes offers powerful scaling capabilities to ensure your applications remain responsive and cost-effective:

Horizontal Pod Autoscaling (HPA) scales the number of pod replicas based on resource utilization or custom metrics.
Vertical Pod Autoscaling (VPA) adjusts the CPU and memory resources for existing pods.
Cluster Autoscaling (CA) adds or removes nodes in your cluster based on resource requirements.

By combining these approaches and following best practices, you can build a responsive, self-healing infrastructure that automatically adapts to changing workloads. Remember that effective scaling requires proper monitoring, resource configuration, and an understanding of your application's behavior under load.

Additional Resources

Exercises

Set up an HPA for a simple web application and test it by generating load with a tool like hey or siege.
Configure Cluster Autoscaler on a test cluster and observe how it reacts when you deploy resource-intensive workloads.
Implement a VPA for a stateful application and analyze the recommendations it provides.
Design a scaling strategy for an application with variable workloads throughout the day.
Set up scaling based on a custom metric from Prometheus, such as request latency or queue length.

If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)

Introduction​

Understanding Scaling Types in Kubernetes​

1. Horizontal Pod Autoscaling (HPA)​

2. Vertical Pod Autoscaling (VPA)​

3. Cluster Autoscaling (CA)​

Horizontal Pod Autoscaling​

How Horizontal Pod Autoscaling Works​

Setting Up an HPA​

Checking HPA Status​

Scaling Based on Custom Metrics​

Vertical Pod Autoscaling​

How Vertical Pod Autoscaling Works​

Setting Up VPA​

VPA Modes​

Checking VPA Status​

Cluster Autoscaling​

How Cluster Autoscaling Works​

Setting Up Cluster Autoscaler​

Node Affinity and Taints for Controlled Scaling​

Practical Real-World Scaling Scenarios​

Scenario 1: Handling Daily Traffic Patterns​

Scenario 2: Handling Unexpected Traffic Spikes​

Scenario 3: Batch Processing Jobs​

Best Practices for Kubernetes Scaling​

1. Set Resource Requests and Limits​

2. Start with Conservative Autoscaling Settings​

3. Monitor Your Scaling Behavior​

4. Use Pod Disruption Budgets​

5. Configure Proper Readiness and Liveness Probes​

Advanced Topics​

Multi-Dimensional Scaling with KEDA​

Scaling Based on Custom Business Metrics​

Summary​

Additional Resources​

Exercises​