Kubernetes Autoscaler- how automatic scaling really works

### a practical guide with real code examples ![Image](https://cdn.prod.website-files.com/635e4ccf77408db6bd802ae6/66b9db46296292b981cc8ffb_AD_4nXcsiqSyspZi6FwiFTVhec9iyfIBKZcJKCnqejGBKH4qfZtd9Fa99IPXpR1YCpshZAZd_UyK9eM0YN37rYwxidWZaG8VGXmdJz5PD8HTAY0FBjboLm_iu62RlBVSMSDlPpe5vI0bUfhWQPWrDl3m.png) Autoscaling in Kubernetes is often described as something almost magical: _add load → Kubernetes scales everything automatically_. In reality, autoscaling is a **set of separate controllers**, each solving a very specific problem. When these mechanisms are misunderstood, autoscaling becomes unstable or even dangerous. When they are understood, it becomes predictable and extremely powerful. This article explains **how autoscaling actually works**, how the different autoscalers interact, and where their real limits are — with concrete YAML examples. --- ## What “autoscaler” really means in Kubernetes There is no single autoscaler in Kubernetes. Instead, there are **three independent layers**, each operating at a different level of the system: - **Horizontal Pod Autoscaler (HPA)** – scales the number of pods - **Vertical Pod Autoscaler (VPA)** – scales CPU and memory requests of pods - **Cluster Autoscaler (CA)** – scales the number of nodes Each one answers a different question, and none of them replaces the others. --- ## Horizontal Pod Autoscaler (HPA) HPA is the most commonly used autoscaler. Its responsibility is simple: **increase or decrease the number of pod replicas** based on metrics. The most important thing to understand is that HPA **never creates nodes**. It only adjusts the `replicas` field of a workload. --- ### How HPA works internally HPA periodically: 1. Reads metrics (CPU, memory, or custom metrics) 2. Compares them to the target 3. Updates the replica count of a Deployment / StatefulSet / ReplicaSet If pods cannot be scheduled because there is no capacity, HPA will still increase replicas — but those pods will remain `Pending`. --- ### HPA example (CPU-based) ```yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: web-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: web minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 ``` What this means in practice: - Kubernetes looks at **average CPU usage across all pods** - If it goes above 70%, replicas increase - If it goes below the threshold, replicas decrease --- ### A critical prerequisite: resource requests HPA relies on **CPU requests**, not limits. ```yaml resources: requests: cpu: 200m memory: 256Mi limits: cpu: 500m memory: 512Mi ``` If requests are missing or unrealistic, HPA calculations become meaningless and scaling becomes unstable. --- ## Vertical Pod Autoscaler (VPA) VPA solves a different problem: **incorrect resource sizing**. Instead of asking _“How many pods do we need?”_, VPA asks: > _How much CPU and memory does each pod actually need?_ --- ![[Pasted image 20251228111127.png]] ### How VPA is implemented VPA consists of three components: - **Recommender** – analyzes historical resource usage - **Updater** – evicts pods when resources must change - **Admission Controller** – injects new requests into pods at creation time Because Kubernetes does not allow changing `requests` on a running pod, **VPA applies changes only by recreating pods**. --- ### VPA example ```yaml apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: api-vpa spec: targetRef: apiVersion: apps/v1 kind: Deployment name: api updatePolicy: updateMode: Auto resourcePolicy: containerPolicies: - containerName: "*" minAllowed: cpu: 100m memory: 128Mi maxAllowed: cpu: 2 memory: 2Gi ``` In `Auto` mode: - pods may be evicted - new pods start with updated requests - scheduler makes better placement decisions In production, VPA is often used in `Off` or `Initial` mode to avoid unexpected evictions. --- ### HPA and VPA together: a warning HPA (CPU-based) and VPA **should not manage the same resource dimension** at the same time. Why: - HPA reacts to CPU usage relative to requests - VPA changes CPU requests - this creates feedback loops A common pattern is: - HPA for scaling replicas - VPA only for **memory** or in recommendation-only mode --- ## Cluster Autoscaler (CA) Cluster Autoscaler works at the **node level**. It answers the question HPA cannot: > _Do we need more machines to run these pods?_ --- ### How Cluster Autoscaler works Cluster Autoscaler watches for: - pods stuck in `Pending` due to insufficient CPU or memory - nodes that are underutilized and safe to remove It does **not** look at CPU usage directly. It reacts to **scheduler failures**, not metrics. --- ### Real autoscaling chain In a healthy setup, scaling usually happens like this: ``` Traffic increases → HPA increases replicas → scheduler cannot place pods → Cluster Autoscaler adds nodes → pods become Running ``` --- ### Cluster Autoscaler configuration (example) On cloud providers, CA is usually deployed as a Deployment: ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: cluster-autoscaler spec: replicas: 1 template: spec: containers: - name: cluster-autoscaler image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.29.0 command: - ./cluster-autoscaler - --cloud-provider=aws - --nodes=1:10:nodegroup-1 ``` The autoscaler interacts directly with the cloud API to add or remove nodes. --- ## What autoscaling does NOT solve Autoscaling does not: - fix slow code - fix memory leaks - fix blocking I/O - fix bad readiness probes - fix stateful bottlenecks Autoscaling **amplifies architecture**, both good and bad. --- ## Common production mistakes One of the most frequent mistakes is enabling HPA without proper resource requests. Another is expecting instant scaling — Cluster Autoscaler reacts in minutes, not seconds. Autoscaling is **reactive**, not predictive. --- ## A mental model that actually works Autoscaling becomes simple if you keep the responsibilities clear: - **HPA** decides _how many pods_ - **VPA** decides _how big each pod should be_ - **Cluster Autoscaler** decides _how many nodes are needed_ Each autoscaler solves one problem and depends on the others to do their part. --- ## Final takeaway Kubernetes autoscaling is powerful precisely because it is **modular**, not magical. When HPA, VPA, and Cluster Autoscaler are configured with realistic expectations and sane resource requests, scaling becomes predictable and boring — which is exactly what you want in production. If you’d like, next we can: - deep dive into **HPA with Prometheus metrics** - compare **Cluster Autoscaler vs Karpenter** - analyze **real autoscaling failures from production systems**