Kubernetes Health Checks Explained. Probes, Parameters, and Production Best Practices

### The difference between a self-healing cluster and a CrashLoopBackOff spiral is three YAML fields most engineers configure wrong. ## 1. Introduction Kubernetes is designed to be self-healing. When a container crashes, it restarts. When a node fails, Pods reschedule. When a deployment rolls out, traffic shifts gradually. But none of this intelligence is automatic — it depends entirely on Kubernetes knowing the truth about your application's health at every moment. That truth comes from **probes**. Without probes, Kubernetes operates blind. It assumes your application is healthy the moment the container process starts. It routes live traffic to Pods that are still warming up. It keeps running Pods that are deadlocked, out of memory, or silently serving 500 errors to every request. It marks rollouts as successful before a single request has been successfully handled. With correctly configured probes, the picture changes entirely. Kubernetes knows when a slow-starting JVM application is actually ready to serve. It detects a deadlocked goroutine pool and restarts the container. It removes a Pod from the load balancer during a scheduled maintenance window and adds it back when the operation completes. It blocks a broken deployment from progressing until new Pods prove they can handle traffic. This guide provides a complete mechanical understanding of Kubernetes health checks: the three probe types, the four check methods, every configuration parameter, the container startup sequence, the interaction with Service endpoints, and the production failure patterns that result from misconfiguration. --- ## 2. Overview of Kubernetes Health Checks Kubernetes provides three distinct probe types, each serving a different role in the container lifecycle. Engineers who treat them as interchangeable — or who configure only one — are leaving significant reliability on the table. ### startupProbe **Question answered:** "Has the application finished starting up?" The `startupProbe` runs _before_ any other probe. While it is active, both `livenessProbe` and `readinessProbe` are suspended. Its sole purpose is to give slow-starting applications sufficient time to initialize without triggering premature liveness restarts. Once `startupProbe` succeeds for the first time, it stops running entirely. It does not repeat. ### livenessProbe **Question answered:** "Is the application still alive and worth keeping?" The `livenessProbe` runs continuously throughout the container's lifetime. When it fails beyond the configured `failureThreshold`, **kubelet kills and restarts the container**. This is the mechanism behind Kubernetes self-healing. Use it to detect states the application cannot recover from on its own: deadlocks, memory corruption, infinite loops, exhausted thread pools with no ability to drain. ### readinessProbe **Question answered:** "Is the application ready to receive traffic right now?" The `readinessProbe` also runs continuously. When it fails, **Kubernetes removes the Pod's IP from the Service endpoints**. Traffic stops reaching the Pod. The container is not restarted. When the probe passes again, the Pod is re-added to the endpoint list and traffic resumes. Use it to signal temporary unavailability: cache warming, database connection establishment, downstream dependency degradation, or scheduled maintenance. ### Probe Role Summary |Probe|Runs When|On Failure|On Success|Repeats| |---|---|---|---|---| |`startupProbe`|Container start, until first success|Container restarted|Probe stops, liveness/readiness begin|No (stops after first success)| |`livenessProbe`|After startup succeeds, continuously|Container killed + restarted|No action|Yes, every `periodSeconds`| |`readinessProbe`|After startup succeeds, continuously|Pod removed from Service endpoints|Pod added to Service endpoints|Yes, every `periodSeconds`| --- ## 3. How Kubernetes Executes Probes Probes are executed by **kubelet** — the node agent running on every Kubernetes worker node. The API server does not run probes. The scheduler does not run probes. kubelet owns the entire probe lifecycle. ``` ┌──────────────────────────────────────────────────────────┐ │ NODE │ │ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ kubelet │ │ │ │ │ │ │ │ Probe scheduler (per container, per probe) │ │ │ │ │ │ │ │ │ ├── httpGet ──▶ HTTP request to container│ │ │ │ ├── tcpSocket ──▶ TCP dial to container │ │ │ │ ├── exec ──▶ command inside container │ │ │ │ └── grpc ──▶ gRPC health check │ │ │ │ │ │ │ │ Result: Success / Failure / Unknown │ │ │ │ │ │ │ │ │ ├── Update Pod status conditions │ │ │ │ ├── Report to API server │ │ │ │ └── Take action (restart / endpoint mgmt) │ │ │ └─────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Container Runtime (containerd / CRI-O) │ │ │ │ Runs exec probes, manages container processes │ │ │ └──────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────┘ ``` kubelet tracks three possible probe states: - **Success** — the check passed - **Failure** — the check failed (counts toward `failureThreshold`) - **Unknown** — the check could not be completed (treated as failure for restart purposes) Probe results are reflected in the Pod's `status.conditions` and influence the `Ready` condition that controls Service endpoint membership. --- ## 4. Probe Types: Check Methods ### 4.1 HTTP Get (`httpGet`) kubelet sends an HTTP GET request to the container. Any response with status code `200–399` is a success. Anything else — including network errors, timeouts, and 4xx/5xx responses — is a failure. ```yaml livenessProbe: httpGet: path: /health/live port: 8080 scheme: HTTP httpHeaders: - name: X-Health-Check value: kubelet ``` **Best for:** Web services, REST APIs, any HTTP server. The most common probe type in production. **Important:** The HTTP check is performed by kubelet on the node, not from within the cluster network. Ensure the endpoint is accessible on the container's network interface, not just on `localhost` loopback if you have unusual network configurations. --- ### 4.2 TCP Socket (`tcpSocket`) kubelet attempts to open a TCP connection to the specified port. If the connection is established, the probe succeeds. The connection is immediately closed — no data is sent or received. ```yaml readinessProbe: tcpSocket: port: 5432 ``` **Best for:** Databases, message brokers, and any service that speaks a binary protocol rather than HTTP. Use this for PostgreSQL, MySQL, Redis, Kafka, and similar workloads where an HTTP endpoint is not available. **Limitation:** TCP success only means the port is open and accepting connections. It does not validate that the application is actually processing requests correctly. --- ### 4.3 Command Execution (`exec`) kubelet executes a command inside the container. Exit code `0` is success. Any non-zero exit code is failure. ```yaml livenessProbe: exec: command: - /bin/sh - -c - "pg_isready -U postgres -h localhost" ``` **Best for:** Databases and legacy applications without HTTP endpoints, custom health logic that cannot be expressed as a network check, or verifying filesystem state (e.g., checking a PID file exists). **Warning:** `exec` probes spawn a new process inside the container for every check. At high `periodSeconds` frequency with CPU-constrained containers, this overhead accumulates. Do not use `exec` probes for high-frequency checks on resource-limited workloads. --- ### 4.4 gRPC (`grpc`) Uses the standard [gRPC Health Checking Protocol](https://github.com/grpc/grpc/blob/master/doc/health-checking.md). kubelet calls the `grpc.health.v1.Health/Check` RPC. A `SERVING` response is success. ```yaml livenessProbe: grpc: port: 50051 service: "myapp.Service" ``` **Best for:** gRPC-native microservices. Requires the application to implement the gRPC health protocol, which most major gRPC frameworks support natively. **Note:** gRPC probes require Kubernetes 1.24+ and the feature gate `GRPCContainerProbe` enabled (enabled by default from 1.27). --- ### Probe Method Comparison |Method|Protocol|Use Case|Validates App Logic|Overhead| |---|---|---|---|---| |`httpGet`|HTTP/HTTPS|Web services, REST APIs|Yes (if endpoint is meaningful)|Low| |`tcpSocket`|TCP|Databases, binary protocols|No (port open only)|Very low| |`exec`|Process exec|Legacy apps, custom checks|Yes (if command is meaningful)|Medium| |`grpc`|gRPC|gRPC microservices|Yes|Low| --- ## 5. Probe Parameters and Configuration Options ### Core Parameters |Parameter|Default|Description| |---|---|---| |`initialDelaySeconds`|`0`|Seconds to wait after container start before first probe.| |`periodSeconds`|`10`|How often (in seconds) to run the probe.| |`timeoutSeconds`|`1`|Seconds after which the probe times out. Counts as failure.| |`successThreshold`|`1`|Minimum consecutive successes to consider probe passing.| |`failureThreshold`|`3`|Consecutive failures before action is taken (restart or endpoint removal).| |`terminationGracePeriodSeconds`|Pod-level default (`30`)|Probe-level override for grace period on liveness failure.| ### How Thresholds Work Together The total time before Kubernetes acts on a failing liveness probe: ``` initialDelaySeconds + (failureThreshold × periodSeconds) ``` Example with defaults (`initialDelaySeconds: 0`, `failureThreshold: 3`, `periodSeconds: 10`): ``` 0 + (3 × 10) = 30 seconds before container restart ``` ### httpGet Fields |Field|Required|Description| |---|---|---| |`path`|Yes|URL path to request (e.g., `/health`)| |`port`|Yes|Port number or named port| |`scheme`|No (default: `HTTP`)|`HTTP` or `HTTPS`| |`host`|No (default: Pod IP)|Override hostname for the request| |`httpHeaders`|No|Custom headers as list of `{name, value}` pairs| ### tcpSocket Fields |Field|Required|Description| |---|---|---| |`port`|Yes|Port number or named port to dial| |`host`|No (default: Pod IP)|Override host address| ### exec Fields |Field|Required|Description| |---|---|---| |`command`|Yes|Command and args as a string array. Runs without shell by default.| ### grpc Fields |Field|Required|Description| |---|---|---| |`port`|Yes|Port number for gRPC server| |`service`|No|Service name to pass to Health/Check RPC| ### Parameter Configuration Example ```yaml livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 15 # Wait 15s after container start periodSeconds: 20 # Check every 20s timeoutSeconds: 5 # Fail if no response in 5s failureThreshold: 3 # Restart after 3 consecutive failures successThreshold: 1 # 1 success to consider healthy readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 5 periodSeconds: 10 timeoutSeconds: 3 failureThreshold: 3 successThreshold: 2 # Require 2 consecutive successes before re-adding to endpoints ``` --- ## 6. Startup Sequence of a Kubernetes Container Understanding the order of operations is critical for correct probe configuration. Many production issues stem from engineers assuming all three probes run simultaneously from container start. ``` Container Created │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ PHASE 1: STARTUP │ │ │ │ startupProbe runs (if configured) │ │ livenessProbe ──── SUSPENDED │ │ readinessProbe ──── SUSPENDED │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ startupProbe polls every periodSeconds │ │ │ │ │ │ │ │ Failure × failureThreshold ──▶ Container RESTART │ │ │ │ First Success ──────────────▶ Phase 2 begins │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘ │ ▼ (startupProbe passes or not configured) ┌─────────────────────────────────────────────────────────────┐ │ PHASE 2: RUNNING (both probes active simultaneously) │ │ │ │ readinessProbe ──── polls every periodSeconds │ │ │ │ │ ├── Failure: Pod IP removed from Service endpoints │ │ │ Pod stays running, no restart │ │ └── Success: Pod IP added to Service endpoints │ │ Pod receives traffic │ │ │ │ livenessProbe ──── polls every periodSeconds │ │ │ │ │ ├── Failure × failureThreshold: Container RESTARTED │ │ └── Success: No action │ └─────────────────────────────────────────────────────────────┘ │ ▼ (livenessProbe triggers restart) Container Terminated → terminationGracePeriodSeconds → New Container │ └──▶ Sequence repeats from Phase 1 ``` ### Key Timing Insight Without `startupProbe`, a slow-starting application faces this dangerous window: ``` t=0 Container starts, JVM begins loading t=10 livenessProbe fires (first check) → app not ready → FAILURE 1 t=20 livenessProbe fires → app still loading → FAILURE 2 t=30 livenessProbe fires → app still loading → FAILURE 3 → RESTART t=0 Container restarts. Loop repeats forever. CrashLoopBackOff. ``` With `startupProbe` (`failureThreshold: 30`, `periodSeconds: 10` = 300s budget): ``` t=0 Container starts, JVM begins loading t=10 startupProbe fires → not ready → failure 1 of 30 (no restart) ... t=120 startupProbe fires → app ready → SUCCESS → startup complete t=130 livenessProbe and readinessProbe begin ``` --- ## 7. Interaction with Kubernetes Networking The `readinessProbe` has a direct and immediate effect on traffic routing through Kubernetes Services. When a Service selects Pods by label, it maintains an `Endpoints` object (or `EndpointSlice` in modern clusters) listing the IP addresses of all Pods currently eligible to receive traffic. The **endpoint controller** watches Pod `Ready` conditions and updates this list continuously. ``` ┌───────────────────────────────────────────────────────────────┐ │ TRAFFIC ROUTING FLOW │ │ │ │ Client Request │ │ │ │ │ ▼ │ │ ┌──────────┐ selector: app=myapp │ │ │ Service │ ──────────────────────────────────────────┐ │ │ └──────────┘ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────┐│ │ │ EndpointSlice ││ │ │ addresses: [10.0.0.1, 10.0.0.3] ← only Ready Pods ││ │ │ NOT included: 10.0.0.2 ← readinessProbe failing ││ │ └──────────────────────────────────────────────────────────┘│ │ │ │ Pod 10.0.0.1 Ready: true ← receives traffic │ │ Pod 10.0.0.2 Ready: false ← excluded from endpoints │ │ Pod 10.0.0.3 Ready: true ← receives traffic │ └───────────────────────────────────────────────────────────────┘ ``` ### During Rolling Updates This mechanism is what makes zero-downtime rolling updates possible: 1. New Pod starts → `readinessProbe` not yet passing → Pod excluded from endpoints 2. New Pod passes readiness → added to endpoints → traffic begins routing to it 3. Old Pod receives `SIGTERM` → `readinessProbe` can immediately fail → removed from endpoints 4. Old Pod drains in-flight requests during `terminationGracePeriodSeconds` 5. Old Pod process exits If step 1 and 3 overlap in time, traffic is always covered by at least one healthy Pod. ### Readiness Gates For advanced use cases, `ReadinessGates` allow external systems to contribute to a Pod's readiness condition. A service mesh or custom controller can set a condition that Kubernetes includes in the overall readiness evaluation — useful for ensuring a sidecar proxy is fully initialized before the Pod receives traffic. --- ## 8. Common Production Problems ### Problem 1: CrashLoopBackOff from Aggressive livenessProbe **Scenario:** A Spring Boot application with a 45-second startup time. `livenessProbe` configured with `initialDelaySeconds: 10`, `failureThreshold: 3`, `periodSeconds: 10`. ``` t=10 liveness fires → app loading → FAIL 1 t=20 liveness fires → app loading → FAIL 2 t=30 liveness fires → app loading → FAIL 3 → RESTART t=0 Container restarts → repeat → CrashLoopBackOff ``` **Fix:** Add `startupProbe` with sufficient budget to cover worst-case startup time. ```yaml startupProbe: httpGet: path: /actuator/health port: 8080 failureThreshold: 30 # 30 × 10s = 5 minutes budget periodSeconds: 10 ``` --- ### Problem 2: Traffic Sent to Unready Pods **Scenario:** No `readinessProbe` configured. Rolling update deploys new Pods. New Pods are added to Service endpoints immediately on container start, before the application finishes initializing. Clients receive connection refused or 503 errors for 20–40 seconds. **Fix:** Always configure `readinessProbe`. Separate it from `livenessProbe` — use a dedicated `/health/ready` endpoint that checks downstream dependencies. --- ### Problem 3: readinessProbe Takes Down Healthy Pods **Scenario:** `readinessProbe` calls an endpoint that checks a third-party payment API. The payment API has a 5-minute outage. All Pods fail readiness and are removed from endpoints. The application is completely unavailable even though it could serve non-payment requests. **Fix:** Readiness probes should check **local** application health, not external dependency health. External dependency checks belong in application-level circuit breakers, not Kubernetes probes. --- ### Problem 4: Network Delays Causing Probe Flapping **Scenario:** `timeoutSeconds: 1` on a probe calling an endpoint that occasionally takes 1.2 seconds under load. Probes intermittently fail and succeed, causing Pods to flap in and out of Service endpoints. Clients experience intermittent errors. **Fix:** Set `timeoutSeconds` to a realistic value based on observed p99 response times at peak load. Use `successThreshold: 2` on `readinessProbe` to require consistent success before re-adding to endpoints. --- ### Problem 5: Misconfigured Port in Probe **Scenario:** Application listens on port `8080` for application traffic, port `9090` for metrics and health. `livenessProbe` configured with `port: 8080` pointing at the app port, but the `/health` path is only served on `9090`. **Symptom:** Probe always returns 404. Container restarts continuously. **Fix:** Always verify probe port and path match exactly. Use named ports to avoid numeric port confusion: ```yaml ports: - name: http containerPort: 8080 - name: health containerPort: 9090 livenessProbe: httpGet: path: /health/live port: health # uses named port, less error-prone ``` --- ## 9. Debugging Health Checks ### Identify Probe Failures ```bash # Show probe configuration and recent events kubectl describe pod <pod-name> # Look for in Events section: # Warning Unhealthy Liveness probe failed: HTTP probe failed with statuscode: 503 # Warning Unhealthy Readiness probe failed: dial tcp: connection refused ``` ### Check Pod Conditions ```bash kubectl get pod <pod-name> -o jsonpath='{.status.conditions}' | jq . # Look for: # type: Ready, status: "False", reason: ContainersNotReady # type: ContainersReady, status: "False" ``` ### Watch Events in Real Time ```bash kubectl get events --sort-by='.metadata.creationTimestamp' -n <namespace> -w # Filter for probe-related: kubectl get events -n <namespace> | grep -i "unhealthy\|probe" ``` ### Check Container Logs Around Restart Time ```bash # Logs from the previous (crashed) container instance kubectl logs <pod-name> --previous # Logs from specific container in multi-container pod kubectl logs <pod-name> -c <container-name> --previous ``` ### Verify Endpoint Membership ```bash # Check if pod is in service endpoints kubectl get endpoints <service-name> kubectl describe endpoints <service-name> # Check EndpointSlices (modern clusters) kubectl get endpointslices -l kubernetes.io/service-name=<service-name> ``` ### Monitor Rollout Health ```bash kubectl rollout status deployment/<name> # "Waiting for deployment rollout to finish: 1 out of 3 new replicas have been updated" # If stuck: new pods are failing readiness kubectl get rs -l app=<name> # If new RS shows READY < DESIRED, readiness probe is blocking progression ``` ### Manual Probe Testing Test your probe endpoints directly from inside the cluster to rule out networking issues: ```bash # Exec into a pod and test the health endpoint manually kubectl exec -it <pod-name> -- curl -v http://localhost:8080/health/ready # Or use a debug pod kubectl run debug --image=curlimages/curl -it --rm -- \ curl http://<pod-ip>:8080/health/ready ``` --- ## 10. Best Practices for Production Systems ### Use Three Separate Endpoints Do not route all probes to the same URL. Each probe has a different semantic purpose and should query different aspects of application state: ``` GET /health/startup → Is initialization complete? GET /health/live → Is the process alive and not deadlocked? GET /health/ready → Is the app ready to handle requests? ``` ### Always Use startupProbe for JVM, Python, and Large Runtimes JVM warmup, Python import chains, and applications loading large ML models all have startup times incompatible with aggressive liveness timeouts. A 300-second startup budget via `startupProbe` with `failureThreshold: 30, periodSeconds: 10` is safer and more explicit than inflating `initialDelaySeconds` on `livenessProbe`. ### Set timeoutSeconds Based on Observed Latency `timeoutSeconds: 1` (the default) is dangerously low for any endpoint that touches a database or makes a downstream call. Measure your health endpoint's p99 latency under load and set `timeoutSeconds` to at least 2× that value. ### Readiness ≠ Liveness Never use the same logic for both probes. A Pod that is temporarily not ready (waiting for a cache warm, holding a maintenance mode flag) should not be restarted. A Pod that has not been ready for 10 minutes probably should be. These are different conditions requiring different probes. ### PodDisruptionBudget + readinessProbe Together A `PodDisruptionBudget` protects against too many Pods being unavailable at once. The `readinessProbe` dynamically removes unavailable Pods from traffic. Together, they prevent both voluntary disruptions (node drains) and organic degradation from routing traffic to broken Pods simultaneously. ### Monitor Probe Metrics `kube-state-metrics` exposes `kube_pod_container_status_ready` and probe-related metrics. Set alerts for: |Metric / Condition|Alert Threshold| |---|---| |Pod `Ready: false` for > 5 minutes|Immediate page| |Container restart count increase|> 2 restarts in 10 minutes| |Probe failure events in namespace|> 5 per minute| |Endpoint count drops below minimum|Below `minAvailable` in PDB| --- ## 11. Real Configuration Examples ### Web Service (REST API) ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: api-service spec: template: spec: containers: - name: api image: registry.example.com/api:v1.2.0 ports: - name: http containerPort: 8080 - name: management containerPort: 9090 startupProbe: httpGet: path: /actuator/health/liveness port: management failureThreshold: 20 periodSeconds: 5 # 100 second startup budget livenessProbe: httpGet: path: /actuator/health/liveness port: management initialDelaySeconds: 0 # startupProbe handles delay periodSeconds: 20 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /actuator/health/readiness port: management initialDelaySeconds: 0 periodSeconds: 10 timeoutSeconds: 3 failureThreshold: 3 successThreshold: 2 ``` --- ### PostgreSQL Database ```yaml containers: - name: postgres image: postgres:15 ports: - containerPort: 5432 startupProbe: exec: command: - /bin/sh - -c - "pg_isready -U $POSTGRES_USER -d $POSTGRES_DB" failureThreshold: 30 periodSeconds: 5 # 150 second startup budget livenessProbe: exec: command: - /bin/sh - -c - "pg_isready -U $POSTGRES_USER" periodSeconds: 30 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: exec: command: - /bin/sh - -c - "pg_isready -U $POSTGRES_USER -d $POSTGRES_DB" periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 successThreshold: 1 ``` --- ### gRPC Microservice ```yaml containers: - name: grpc-service image: registry.example.com/grpc-service:v2.0.0 ports: - name: grpc containerPort: 50051 startupProbe: grpc: port: 50051 service: "grpc.health.v1.Health" failureThreshold: 15 periodSeconds: 5 livenessProbe: grpc: port: 50051 service: "grpc.health.v1.Health" periodSeconds: 20 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: grpc: port: 50051 service: "myapp.PaymentService" # service-specific check periodSeconds: 10 timeoutSeconds: 3 failureThreshold: 3 successThreshold: 1 ``` --- ### Slow-Starting Application (ML Model Server) ```yaml containers: - name: model-server image: registry.example.com/model-server:v1.0.0 ports: - containerPort: 8501 startupProbe: httpGet: path: /v1/models/mymodel # model must be loaded port: 8501 failureThreshold: 60 periodSeconds: 10 # 600 second (10 min) startup budget timeoutSeconds: 10 livenessProbe: tcpSocket: port: 8501 # just verify process is alive periodSeconds: 30 timeoutSeconds: 5 failureThreshold: 3 readinessProbe: httpGet: path: /v1/models/mymodel:predict port: 8501 periodSeconds: 15 timeoutSeconds: 10 failureThreshold: 2 successThreshold: 1 ``` --- ### Redis Cache ```yaml containers: - name: redis image: redis:7-alpine ports: - containerPort: 6379 livenessProbe: exec: command: - redis-cli - ping periodSeconds: 10 timeoutSeconds: 3 failureThreshold: 3 readinessProbe: exec: command: - redis-cli - ping periodSeconds: 5 timeoutSeconds: 2 failureThreshold: 3 successThreshold: 1 ``` --- ## 12. Conclusion Kubernetes health probes are not configuration boilerplate — they are the nervous system of your cluster's self-healing capability. Every meaningful Kubernetes behavior that engineers rely on in production — zero-downtime rollouts, automatic restarts, traffic routing, safe node drains — depends on probes being configured correctly. The three probes serve three fundamentally different purposes. `startupProbe` buys slow applications the time they need to initialize without triggering false positive restarts. `livenessProbe` detects unrecoverable application states and triggers container restarts. `readinessProbe` dynamically manages traffic routing based on real-time application availability. Conflating their roles or omitting any of them creates reliability gaps that manifest as CrashLoopBackOff spirals, traffic sent to unready Pods, and deployment rollouts that block or cascade incorrectly. The engineers who get this right share a common habit: they treat probe configuration as application-specific, not generic. They measure actual startup times and set budgets accordingly. They build dedicated health endpoints with clear semantics. They test probes under realistic load before deploying to production. They alert on probe failure rates and Pod readiness conditions as first-class signals. Configure probes thoughtfully, and Kubernetes delivers on its self-healing promise. Treat them as an afterthought, and you will configure them in production at 3 AM under pressure — which is the worst possible time to learn how they work. --- ## Quick Reference: Probe Configuration Checklist ``` □ startupProbe configured for any app with startup time > 30s □ livenessProbe checks process health only (no external dependencies) □ readinessProbe checks local app readiness only (no external dependencies) □ Separate health endpoints for liveness vs readiness □ timeoutSeconds > p99 latency of health endpoint under load □ failureThreshold × periodSeconds > acceptable flap window □ successThreshold ≥ 2 on readinessProbe for stability □ Named ports used in probe configuration □ Probe endpoints tested manually before production deployment □ Probe failure alerts configured in monitoring ``` --- _Article maintained at [doc.thedevops.dev](https://doc.thedevops.dev/) | Last updated: March 2026_