Kubernetes HorizontalPodAutoscaler
Introduction
The HorizontalPodAutoscaler (HPA) is a Kubernetes resource that automatically scales the number of pod replicas in a deployment, replicaset, or statefulset based on observed resource utilization. This helps ensure your application has enough resources to handle varying workloads while optimizing cluster resource usage.
Think of HPA as an automatic adjustment system that:
- Adds more pod instances when your application is under heavy load
- Removes unnecessary pod instances when demand decreases
- Helps maintain performance without manual intervention
In this guide, we'll explore how HPA works, how to set it up, and see real-world examples of its application.
How HorizontalPodAutoscaler Works
HPA operates on a simple feedback loop:
- It continuously monitors the specified metrics (typically CPU or memory usage)
- It calculates the desired number of replicas based on current usage vs target usage
- It adjusts the number of replicas to match the calculated value
Prerequisites
To use HPA effectively, your Kubernetes cluster needs:
- A metrics server installed (provides resource metrics like CPU and memory)
- Resources and requests defined for your pods
- A controller (like Deployment) that can be scaled
Basic HPA Configuration
Let's start with a simple example. First, we need a deployment with resource requests defined:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 1
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: my-app
image: nginx:latest
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
Now, let's create an HPA that will automatically scale this deployment based on CPU usage:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
This HPA configuration:
- Targets our
my-app
deployment - Sets a minimum of 1 replica and a maximum of 10 replicas
- Aims to maintain average CPU utilization at 50%
- Will add more pods if CPU usage exceeds 50%
- Will remove pods if CPU usage falls below 50%
Creating and Managing HPA
You can create an HPA using the YAML above with:
kubectl apply -f hpa.yaml
Alternatively, use the kubectl autoscale command:
kubectl autoscale deployment my-app --cpu-percent=50 --min=1 --max=10
Checking HPA Status
To see the current status of your HPA:
kubectl get hpa
Output might look like:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
my-app-hpa Deployment/my-app 7%/50% 1 10 1 5m
For more detailed information:
kubectl describe hpa my-app-hpa
Advanced HPA Configuration
Multiple Metrics
HPA can scale based on multiple metrics simultaneously:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 60
This will scale based on whichever metric requires more replicas.
Custom Metrics
With a custom metrics adapter, you can scale based on application-specific metrics:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: 1000
External Metrics
You can even use metrics from external systems:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metric:
name: queue_messages_ready
selector:
matchLabels:
queue: worker_tasks
target:
type: AverageValue
averageValue: 30
HPA Behavior Configuration
In newer Kubernetes versions, you can fine-tune how aggressively HPA scales up or down:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
This configuration:
- Allows rapid scaling up (100% increase every 15 seconds)
- Implements conservative scaling down (10% decrease every 60 seconds)
- Uses a 5-minute stabilization window before scaling down to prevent "flapping"
Real-World Example: Web Application with HPA
Let's implement HPA for a web application that needs to handle varying traffic levels.
1. Define the Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp
spec:
replicas: 2
selector:
matchLabels:
app: webapp
template:
metadata:
labels:
app: webapp
spec:
containers:
- name: webapp
image: mywebapp:1.0
ports:
- containerPort: 8080
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
name: webapp-service
spec:
selector:
app: webapp
ports:
- port: 80
targetPort: 8080
type: ClusterIP
2. Configure HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: webapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: webapp
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
3. Testing the Autoscaler
To test your HPA, you can generate artificial load:
# Create a pod to generate load
kubectl run -i --tty load-generator --rm --image=busybox --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://webapp-service; done"
Watch the HPA respond:
kubectl get hpa webapp-hpa --watch
You should see the number of replicas increase as CPU utilization rises.
Common Issues and Troubleshooting
Metrics Not Available
If HPA shows "unknown" metrics:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
my-app-hpa Deployment/my-app <unknown>/50% 1 10 1 5m
Check that:
- The metrics server is installed and running
- Pods have resource requests defined
- The metrics API is accessible:
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods"
Scaling Not Working As Expected
If your HPA doesn't scale as expected:
-
Check current resource utilization:
bashkubectl top pods
-
Verify that your target utilization is appropriate
-
Check HPA events for clues:
bashkubectl describe hpa my-app-hpa
Best Practices
-
Set Appropriate Resource Requests:
- The accuracy of HPA depends on correctly defined resource requests
- Set reasonable values based on application needs
-
Choose Target Utilization Wisely:
- Lower values (like 50%) provide more headroom but use more resources
- Higher values (like 80%) are more efficient but provide less buffer
-
Configure Scaling Behavior:
- Tune scaling policies based on your application's characteristics
- Use longer stabilization windows for applications that don't handle scaling well
-
Ensure Services Have Readiness Probes:
- Prevents traffic from being sent to pods that aren't ready yet
-
Use Pod Disruption Budgets:
- Ensures high availability during scale-down operations
yamlapiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: webapp-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: webapp
Summary
The HorizontalPodAutoscaler is a powerful Kubernetes resource that helps:
- Automatically adjust application scale based on workload
- Optimize resource utilization
- Ensure application performance under varying loads
- Reduce operational overhead
By properly configuring HPA, you can ensure your applications have the right amount of resources at the right time, saving costs during low-traffic periods and maintaining performance during high-traffic periods.
Exercises
- Set up an HPA for a simple web application that scales based on CPU usage
- Configure an HPA that uses both CPU and memory metrics
- Experiment with different behavior configurations and observe how they affect scaling operations
- Create an HPA that scales down very conservatively but scales up quickly
- Use a load testing tool to test your HPA configuration under realistic conditions
Additional Resources
If you spot any mistakes on this website, please let me know at [email protected]. I’d greatly appreciate your feedback! :)