### a practical guide with real code examples

Autoscaling in Kubernetes is often described as something almost magical: _add load → Kubernetes scales everything automatically_.
In reality, autoscaling is a **set of separate controllers**, each solving a very specific problem. When these mechanisms are misunderstood, autoscaling becomes unstable or even dangerous. When they are understood, it becomes predictable and extremely powerful.
This article explains **how autoscaling actually works**, how the different autoscalers interact, and where their real limits are — with concrete YAML examples.
---
## What “autoscaler” really means in Kubernetes
There is no single autoscaler in Kubernetes. Instead, there are **three independent layers**, each operating at a different level of the system:
- **Horizontal Pod Autoscaler (HPA)** – scales the number of pods
- **Vertical Pod Autoscaler (VPA)** – scales CPU and memory requests of pods
- **Cluster Autoscaler (CA)** – scales the number of nodes
Each one answers a different question, and none of them replaces the others.
---
## Horizontal Pod Autoscaler (HPA)
HPA is the most commonly used autoscaler.
Its responsibility is simple: **increase or decrease the number of pod replicas** based on metrics.
The most important thing to understand is that HPA **never creates nodes**. It only adjusts the `replicas` field of a workload.
---
### How HPA works internally
HPA periodically:
1. Reads metrics (CPU, memory, or custom metrics)
2. Compares them to the target
3. Updates the replica count of a Deployment / StatefulSet / ReplicaSet
If pods cannot be scheduled because there is no capacity, HPA will still increase replicas — but those pods will remain `Pending`.
---
### HPA example (CPU-based)
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
```
What this means in practice:
- Kubernetes looks at **average CPU usage across all pods**
- If it goes above 70%, replicas increase
- If it goes below the threshold, replicas decrease
---
### A critical prerequisite: resource requests
HPA relies on **CPU requests**, not limits.
```yaml
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
```
If requests are missing or unrealistic, HPA calculations become meaningless and scaling becomes unstable.
---
## Vertical Pod Autoscaler (VPA)
VPA solves a different problem: **incorrect resource sizing**.
Instead of asking _“How many pods do we need?”_, VPA asks:
> _How much CPU and memory does each pod actually need?_
---
![[Pasted image 20251228111127.png]]
### How VPA is implemented
VPA consists of three components:
- **Recommender** – analyzes historical resource usage
- **Updater** – evicts pods when resources must change
- **Admission Controller** – injects new requests into pods at creation time
Because Kubernetes does not allow changing `requests` on a running pod, **VPA applies changes only by recreating pods**.
---
### VPA example
```yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
updatePolicy:
updateMode: Auto
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2
memory: 2Gi
```
In `Auto` mode:
- pods may be evicted
- new pods start with updated requests
- scheduler makes better placement decisions
In production, VPA is often used in `Off` or `Initial` mode to avoid unexpected evictions.
---
### HPA and VPA together: a warning
HPA (CPU-based) and VPA **should not manage the same resource dimension** at the same time.
Why:
- HPA reacts to CPU usage relative to requests
- VPA changes CPU requests
- this creates feedback loops
A common pattern is:
- HPA for scaling replicas
- VPA only for **memory** or in recommendation-only mode
---
## Cluster Autoscaler (CA)
Cluster Autoscaler works at the **node level**.
It answers the question HPA cannot:
> _Do we need more machines to run these pods?_
---
### How Cluster Autoscaler works
Cluster Autoscaler watches for:
- pods stuck in `Pending` due to insufficient CPU or memory
- nodes that are underutilized and safe to remove
It does **not** look at CPU usage directly.
It reacts to **scheduler failures**, not metrics.
---
### Real autoscaling chain
In a healthy setup, scaling usually happens like this:
```
Traffic increases
→ HPA increases replicas
→ scheduler cannot place pods
→ Cluster Autoscaler adds nodes
→ pods become Running
```
---
### Cluster Autoscaler configuration (example)
On cloud providers, CA is usually deployed as a Deployment:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
spec:
replicas: 1
template:
spec:
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.29.0
command:
- ./cluster-autoscaler
- --cloud-provider=aws
- --nodes=1:10:nodegroup-1
```
The autoscaler interacts directly with the cloud API to add or remove nodes.
---
## What autoscaling does NOT solve
Autoscaling does not:
- fix slow code
- fix memory leaks
- fix blocking I/O
- fix bad readiness probes
- fix stateful bottlenecks
Autoscaling **amplifies architecture**, both good and bad.
---
## Common production mistakes
One of the most frequent mistakes is enabling HPA without proper resource requests.
Another is expecting instant scaling — Cluster Autoscaler reacts in minutes, not seconds.
Autoscaling is **reactive**, not predictive.
---
## A mental model that actually works
Autoscaling becomes simple if you keep the responsibilities clear:
- **HPA** decides _how many pods_
- **VPA** decides _how big each pod should be_
- **Cluster Autoscaler** decides _how many nodes are needed_
Each autoscaler solves one problem and depends on the others to do their part.
---
## Final takeaway
Kubernetes autoscaling is powerful precisely because it is **modular**, not magical. When HPA, VPA, and Cluster Autoscaler are configured with realistic expectations and sane resource requests, scaling becomes predictable and boring — which is exactly what you want in production.
If you’d like, next we can:
- deep dive into **HPA with Prometheus metrics**
- compare **Cluster Autoscaler vs Karpenter**
- analyze **real autoscaling failures from production systems**