Kubernetes Cluster Scaling
Introduction
Scaling is one of the most powerful features of Kubernetes. As your applications face varying workloads, the ability to scale resources up or down becomes crucial for maintaining performance while optimizing costs. Kubernetes offers several approaches to scaling, from manually adjusting the number of running pods to sophisticated automatic scaling based on resource utilization metrics.
In this guide, we'll explore different scaling strategies for Kubernetes clusters, how to implement them, and best practices to ensure your applications remain responsive regardless of traffic patterns.
Understanding Scaling Types in Kubernetes
Before diving into implementation, let's understand the different types of scaling available in Kubernetes:
1. Horizontal Pod Autoscaling (HPA)
Horizontal scaling increases or decreases the number of pod replicas in a deployment or replication controller. This is useful when your application can distribute load across multiple instances.
2. Vertical Pod Autoscaling (VPA)
Vertical scaling adjusts the CPU and memory resources allocated to existing pods. This is helpful for applications that can't be easily horizontally scaled.
3. Cluster Autoscaling (CA)
Cluster scaling adjusts the number of nodes in your cluster based on resource requirements. This ensures you have enough infrastructure to run your workloads.
Horizontal Pod Autoscaling
How Horizontal Pod Autoscaling Works
Horizontal Pod Autoscaling automatically scales the number of pods in a deployment, replication controller, or replica set based on observed CPU utilization or custom metrics.
Setting Up an HPA
To use the Horizontal Pod Autoscaler, you need the metrics server installed in your cluster. Most managed Kubernetes services have this pre-installed.
Here's a simple example of creating an HPA:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: example-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: example-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
This HPA will:
- Target a deployment named
example-app
- Maintain between 2 and 10 replicas
- Scale based on CPU utilization, targeting 50% average utilization
You can apply this configuration using:
kubectl apply -f hpa.yaml
Checking HPA Status
You can check the status of your HPAs using:
kubectl get hpa
Example output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
example-app Deployment/example-app 85%/50% 2 10 6 2m
In this example, the current CPU utilization is 85%, causing the HPA to scale up to 6 replicas.
Scaling Based on Custom Metrics
You can also scale based on custom metrics using Prometheus and the Prometheus Adapter. Here's an example HPA that scales based on requests per second:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: example-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: example-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 100
Vertical Pod Autoscaling
How Vertical Pod Autoscaling Works
Vertical Pod Autoscaling (VPA) automatically adjusts the CPU and memory resource requests and limits of containers in pods. This is useful for applications that can't be horizontally scaled.
Setting Up VPA
To use VPA, you first need to install the Vertical Pod Autoscaler operator:
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler/
./hack/vpa-up.sh
Here's an example VPA configuration:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: example-app-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: example-app
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: '*'
minAllowed:
cpu: 100m
memory: 50Mi
maxAllowed:
cpu: 1
memory: 500Mi
This VPA will:
- Target a deployment named
example-app
- Automatically update resource requirements
- Set min/max boundaries for CPU and memory
VPA Modes
VPA supports three update modes:
Off
: VPA only provides recommendations, doesn't apply themInitial
: VPA applies recommendations only at pod creationAuto
: VPA automatically updates resource requirements for running pods
Checking VPA Status
You can check VPA recommendations with:
kubectl describe vpa example-app-vpa
Cluster Autoscaling
How Cluster Autoscaling Works
Cluster Autoscaler adjusts the size of your Kubernetes cluster when:
- There are pods that failed to run due to insufficient resources
- There are nodes that have been underutilized for a significant amount of time
Setting Up Cluster Autoscaler
Setting up Cluster Autoscaler varies by cloud provider. Here's an example for Google Kubernetes Engine (GKE):
gcloud container clusters update example-cluster \
--enable-autoscaling \
--min-nodes=3 \
--max-nodes=10 \
--zone=us-central1-a
For AWS EKS, you would typically deploy the autoscaler as a deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
app: cluster-autoscaler
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
serviceAccountName: cluster-autoscaler
containers:
- image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.23.0
name: cluster-autoscaler
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<YOUR-CLUSTER-NAME>
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
Node Affinity and Taints for Controlled Scaling
You can control where pods are scheduled using node affinity and taints, which helps the Cluster Autoscaler make better decisions:
apiVersion: apps/v1
kind: Deployment
metadata:
name: example-app
spec:
replicas: 3
selector:
matchLabels:
app: example-app
template:
metadata:
labels:
app: example-app
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values:
- compute-optimized
containers:
- name: example-app
image: example/app:latest
resources:
requests:
memory: "256Mi"
cpu: "500m"
limits:
memory: "512Mi"
cpu: "1000m"
Practical Real-World Scaling Scenarios
Scenario 1: Handling Daily Traffic Patterns
Many applications experience predictable traffic patterns. For example, an e-commerce site might see more traffic during evenings and weekends. In this case, you might use HPA in combination with Cluster Autoscaler:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: shopping-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: shopping-app
minReplicas: 5 # Higher baseline during business hours
maxReplicas: 20
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Scale up quickly
scaleDown:
stabilizationWindowSeconds: 300 # Scale down slowly
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
Scenario 2: Handling Unexpected Traffic Spikes
For applications that might experience sudden, unpredictable traffic spikes (like a news site), you'll want a more aggressive scaling strategy:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: news-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: news-app
minReplicas: 5
maxReplicas: 50 # Allow more headroom for unexpected spikes
behavior:
scaleUp:
policies:
- type: Percent
value: 100
periodSeconds: 15 # Double pods every 15 seconds if needed
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50 # More aggressive target
Scenario 3: Batch Processing Jobs
For batch processing workloads, you might want to scale nodes with specific capabilities:
apiVersion: batch/v1
kind: Job
metadata:
name: data-processing-job
spec:
parallelism: 10
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- c5.2xlarge
containers:
- name: data-processor
image: data-processor:latest
resources:
requests:
cpu: 1
memory: 4Gi
limits:
cpu: 2
memory: 8Gi
restartPolicy: Never
Combined with a properly configured Cluster Autoscaler, this will provision the right type of nodes for the job and scale them back when complete.
Best Practices for Kubernetes Scaling
1. Set Resource Requests and Limits
Always set proper resource requests and limits for your containers:
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
This helps Kubernetes make better scheduling and scaling decisions.
2. Start with Conservative Autoscaling Settings
Begin with conservative settings and adjust based on observed behavior:
- HPA: Start with a higher CPU threshold (70-80%)
- VPA: Use the "Initial" or "Off" update mode first
- Cluster Autoscaler: Set scale-down delay to a higher value (10-15 minutes)
3. Monitor Your Scaling Behavior
Use Kubernetes dashboard, Prometheus, and Grafana to monitor scaling events:
kubectl get events --field-selector reason=ScalingReplicaSet
4. Use Pod Disruption Budgets
Protect your applications during scaling with Pod Disruption Budgets:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: example-app-pdb
spec:
minAvailable: 2 # or maxUnavailable: 1
selector:
matchLabels:
app: example-app
5. Configure Proper Readiness and Liveness Probes
Ensure Kubernetes knows when your pods are ready to receive traffic:
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
Advanced Topics
Multi-Dimensional Scaling with KEDA
Kubernetes Event-Driven Autoscaling (KEDA) allows scaling based on event sources like message queues:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: rabbitmq-scaledobject
spec:
scaleTargetRef:
name: consumer-app
minReplicaCount: 1
maxReplicaCount: 30
triggers:
- type: rabbitmq
metadata:
protocol: amqp
queueName: orders
host: rabbitmq.default.svc:5672
queueLength: "50"
Scaling Based on Custom Business Metrics
You can scale based on business metrics like response time:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: example-app
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: example-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: External
external:
metric:
name: nginx_ingress_controller_response_duration_seconds_sum
target:
type: AverageValue
averageValue: 0.1
Summary
Kubernetes offers powerful scaling capabilities to ensure your applications remain responsive and cost-effective:
- Horizontal Pod Autoscaling (HPA) scales the number of pod replicas based on resource utilization or custom metrics.
- Vertical Pod Autoscaling (VPA) adjusts the CPU and memory resources for existing pods.
- Cluster Autoscaling (CA) adds or removes nodes in your cluster based on resource requirements.
By combining these approaches and following best practices, you can build a responsive, self-healing infrastructure that automatically adapts to changing workloads. Remember that effective scaling requires proper monitoring, resource configuration, and an understanding of your application's behavior under load.
Additional Resources
- Kubernetes Official Documentation on HPA
- Kubernetes Autoscaler GitHub Repository
- KEDA - Kubernetes Event-Driven Autoscaling
Exercises
- Set up an HPA for a simple web application and test it by generating load with a tool like
hey
orsiege
. - Configure Cluster Autoscaler on a test cluster and observe how it reacts when you deploy resource-intensive workloads.
- Implement a VPA for a stateful application and analyze the recommendations it provides.
- Design a scaling strategy for an application with variable workloads throughout the day.
- Set up scaling based on a custom metric from Prometheus, such as request latency or queue length.
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)