Kubernetes Testing in Practice. Chaos Engineering, Load Testing, Security Auditing, and Resource Optimization

## Introduction Running Kubernetes in production without a testing strategy is an act of faith. You deploy workloads, apply configurations, and assume they'll behave. Sometimes they do. Then a node goes down at 3am, a deployment gets OOMKilled under load, or a container image ships with a critical CVE that nobody scanned. Testing Kubernetes is not a single discipline. It spans four distinct problem domains: resilience, performance, security, and resource efficiency. Each requires different tools, different mental models, and different workflows. Most teams pick one and ignore the rest. This article covers all four. We'll examine five tools — Chaos Mesh, k6, Trivy, kube-bench, and Goldilocks — in the context of a real DigitalOcean Kubernetes cluster provisioned with Terraform. For each tool we'll describe what it actually does, when to use it, what the output looks like, and how to interpret results. No installation walkthroughs. No Helm chart configuration deep dives. Just the tools themselves — what they measure, what they tell you, and what to do with the information. ## Source Code All Terraform files, Helm configurations, k6 test scripts, and Makefile from this article are available on GitHub: **[github.com/vladlevinas/Kubernetes-stresstest](https://github.com/vladlevinas/Kubernetes-stresstest)** The repository includes: - `main.tf` — cluster + all tools in one pass - `variables.tf` / `outputs.tf` - `terraform.tfvars.example` - `k6-test.yaml` — load test example - `Makefile` — shortcuts for every operation - `README.md` — quick start guide --- ## The Test Environment Before covering the tools, it's worth describing the infrastructure they run on. Understanding the cluster configuration helps interpret test results correctly — resource constraints on small nodes affect what's measurable and what's noise. ### What Terraform Provisions The Terraform configuration creates a complete Kubernetes testing platform on DigitalOcean in a single `terraform apply`. Here's what gets built: ``` ┌─────────────────────────────────────────────────────────────────┐ │ DigitalOcean DOKS — fra1 (Frankfurt) │ │ │ │ Control Plane (managed by DigitalOcean, no cost) │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ Kubernetes API Server │ etcd │ Scheduler │ CM │ │ │ └───────────────────────────────────────────────────────────┘ │ │ │ │ │ ┌─────────────┴──────────────┐ │ │ ▼ ▼ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Node 1 │ │ Node 2 │ │ │ │ s-1vcpu-2gb │ │ s-1vcpu-2gb │ │ │ │ 1 vCPU │ │ 1 vCPU │ │ │ │ 2 GB RAM │ │ 2 GB RAM │ │ │ │ containerd │ │ containerd │ │ │ └────────┬────────┘ └────────┬────────┘ │ │ │ │ │ │ ┌────────▼──────────────────────────▼────────┐ │ │ │ Testing Stack │ │ │ │ │ │ │ │ chaos-mesh ns k6 ns trivy-system ns │ │ │ │ goldilocks ns default ns │ │ │ └────────────────────────────────────────────┘ │ │ │ │ NodePort services: :32333 (Chaos Mesh) :32080 (Goldilocks) │ │ Cost: ~$24/mo while running | $0 when destroyed │ └─────────────────────────────────────────────────────────────────┘ ``` ### Cluster Specifications |Property|Value| |---|---| |Provider|DigitalOcean DOKS| |Region|fra1 (Frankfurt)| |Node size|s-1vcpu-2gb| |Node count|2| |Kubernetes version|1.31.x| |Container runtime|containerd| |Control plane|Managed (DigitalOcean)| |HA control plane|No| |Total RAM|4 GB| |Total vCPU|2| |Monthly cost|~$24| ![[Pasted image 20260316185118.png]] ### What the Terraform Creates The configuration provisions these Kubernetes resources: ``` digitalocean_kubernetes_cluster — the cluster itself kubernetes_namespace (chaos-mesh) — isolated namespace helm_release (chaos-mesh) — Chaos Mesh operator + dashboard helm_release (k6-operator) — k6 operator with own namespace helm_release (trivy-operator) — Trivy with own namespace helm_release (goldilocks) — Goldilocks with own namespace kubernetes_namespace (goldilocks) — resource advisor namespace kubernetes_labels (default) — enables Goldilocks on default ns kubernetes_job (kube-bench) — one-shot CIS audit job kubernetes_service_account — Chaos Mesh dashboard auth kubernetes_cluster_role_binding — cluster-admin for dashboard local_file (kubeconfig.yaml) — written to project directory ``` All five tools deploy independently. If one fails, the others continue. kube-bench runs immediately after cluster creation and exits — it's a one-shot job, not a long-running operator. ![[Pasted image 20260316185200.png]] ## Comparative Analysis ### Tool Comparison Table ||Chaos Mesh|k6|Trivy|kube-bench|Goldilocks| |---|---|---|---|---|---| |**Category**|Resilience|Performance|Security|Security|Efficiency| |**What it tests**|Failure recovery|Load capacity|Image CVEs + config|CIS compliance|Resource sizing| |**When it runs**|On demand|On demand|Continuously|One-shot|Continuously| |**Output type**|Pod events, dashboard|Metrics, pass/fail|VulnerabilityReports|Log output|Dashboard| |**Affects workloads?**|Yes (by design)|Yes (generates load)|No|No|No| |**Requires traffic**|No|Generates it|No|No|Yes| |**CI/CD friendly**|Partially|Yes|Yes|Yes|No| |**Dashboard**|Yes (:32333)|No|No|No|Yes (:32080)| |**External access**|NodePort|No|No|No|NodePort| ### Why This Node Size Matters for Testing Two `s-1vcpu-2gb` nodes give 4 GB total RAM and 2 vCPUs for the entire cluster including system pods and all testing tools. This is intentional — it mirrors a constrained environment where resource pressure is real. Tests run against this backdrop: - Chaos experiments that kill pods are meaningful because the cluster has limited headroom - Load tests produce realistic throttling behavior - Goldilocks recommendations reflect actual resource competition - kube-bench results reflect a managed K8s environment with provider-controlled settings If you test on a 16-core, 64GB cluster with nothing running, you're not testing — you're confirming that idle systems don't break. --- ## The Four Dimensions of Kubernetes Testing ``` ┌─────────────────────────┐ │ Kubernetes Testing │ └───────────┬─────────────┘ │ ┌─────────────────────┼─────────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Resilience │ │ Performance │ │ Security │ │ │ │ │ │ │ │ Chaos Mesh │ │ k6 │ │ Trivy │ │ │ │ │ │ kube-bench │ └─────────────┘ └─────────────┘ └─────────────┘ │ ▼ ┌─────────────┐ │ Efficiency │ │ │ │ Goldilocks │ └─────────────┘ ``` Each dimension answers a different question: - **Resilience**: Does the system recover when things break? - **Performance**: Does the system hold up under real load? - **Security**: Are there known vulnerabilities or misconfigurations? - **Efficiency**: Are resources allocated correctly? Most teams test performance. Few test resilience. Almost none test all four together. --- ## Chaos Mesh — Resilience Testing Through Deliberate Failure ### What It Is Chaos Mesh is a cloud-native chaos engineering platform for Kubernetes. It injects faults at the infrastructure level — killing pods, corrupting network traffic, stressing CPU and memory, filling disks — and observes how the system responds. The core insight behind chaos engineering is that distributed systems fail in ways that are impossible to predict from code review or load testing alone. The only way to know how a system behaves under failure is to make it fail, intentionally, in a controlled way. ``` ┌────────────────────────────────────────────────────────┐ │ Chaos Mesh Architecture │ │ │ │ Dashboard (:32333) │ │ │ │ │ ▼ │ │ Chaos Controller Manager (Deployment) │ │ │ watches CRDs │ │ ▼ │ │ ┌──────────────────────────────────────────────┐ │ │ │ Chaos CRDs │ │ │ │ PodChaos │ NetworkChaos │ StressChaos │ ... │ │ │ └──────────────────────────────────────────────┘ │ │ │ instructs │ │ ▼ │ │ Chaos Daemon (DaemonSet — runs on every node) │ │ │ directly injects faults via │ │ ▼ │ │ Container Runtime (containerd) │ └────────────────────────────────────────────────────────┘ ``` ![[Pasted image 20260316185633.png|1200]] ### Fault Types Chaos Mesh supports eight categories of chaos: **PodChaos** — direct pod lifecycle manipulation: - `pod-kill`: terminates pods matching a selector - `pod-failure`: makes pods enter a failure state without killing them - `container-kill`: kills specific containers within a pod **NetworkChaos** — traffic manipulation at the network layer: - `delay`: adds configurable latency to all traffic in/out of a pod - `loss`: randomly drops a percentage of packets - `duplicate`: duplicates packets - `corrupt`: corrupts packet content - `partition`: completely cuts network between pods **StressChaos** — resource exhaustion: - CPU workers: spawns processes that consume CPU cycles - Memory workers: allocates and holds memory to trigger OOM conditions **HTTPChaos** — HTTP-level injection: - Delays, aborts, and request/response modifications at Layer 7 **TimeChaos** — clock skew injection for testing time-sensitive logic **IOChaos** — filesystem I/O fault injection: delays, errors, and attribute modifications **DNSChaos** — DNS resolution failures and random errors **KernelChaos** — kernel-level fault injection (requires privileged access) ![[Pasted image 20260316185812.png|1200]] ### Using Chaos Mesh Chaos Mesh has two interfaces: the dashboard and Kubernetes CRDs. Both create the same underlying objects. **Via dashboard (http://NODE-IP:32333):** The dashboard provides a visual workflow: choose fault type → configure selector → set mode → set duration → submit. The "Preview of Pods to be injected" section shows exactly which pods will be targeted before you commit. **Via kubectl and YAML:** ```yaml apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: pod-kill-nginx namespace: chaos-mesh spec: action: pod-kill mode: one selector: namespaces: - default labelSelectors: app: nginx-test duration: "30s" ``` ```bash kubectl apply -f pod-chaos.yaml kubectl get podchaos -n chaos-mesh ``` ``` NAME ACTION DURATION AGE pod-kill-nginx pod-kill 30s 12s ``` **Network delay experiment:** ```yaml apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: network-delay-100ms namespace: chaos-mesh spec: action: delay mode: one selector: namespaces: [default] labelSelectors: app: nginx-test delay: latency: "100ms" jitter: "10ms" duration: "5m" ``` **CPU stress experiment:** ```yaml apiVersion: chaos-mesh.org/v1alpha1 kind: StressChaos metadata: name: cpu-stress-80pct namespace: chaos-mesh spec: mode: one selector: namespaces: [default] labelSelectors: app: nginx-test stressors: cpu: workers: 1 load: 80 duration: "3m" ``` ### Reading Results During an experiment, watch what happens: ```bash kubectl get pods -w --kubeconfig=kubeconfig.yaml ``` ``` NAME READY STATUS RESTARTS AGE nginx-test-6ff8-slwrl 1/1 Running 0 5m nginx-test-6ff8-slwrl 1/1 Terminating 0 5m nginx-test-6ff8-abc12 0/1 Pending 0 0s nginx-test-6ff8-abc12 0/1 ContainerCreating 0 1s nginx-test-6ff8-abc12 1/1 Running 0 4s ``` This is the expected sequence for a healthy deployment. If the pod stays in `Terminating` or the replacement takes more than 30 seconds to reach `Running`, you have a problem — either missing readiness probes, insufficient replica count, or PodDisruptionBudgets not set. The Chaos Mesh dashboard's Events tab shows a timeline of all injected faults with start/end times, which you can correlate against monitoring data. ### When to Use Chaos Mesh Use Chaos Mesh when you need to answer these questions: - Does my Deployment recover automatically when pods are killed? - What happens to my service when a node disappears? - How does my application respond to 200ms added latency on downstream services? - Does my circuit breaker actually trip under real network conditions? - What's my real RTO (Recovery Time Objective) for a pod failure? Do not use Chaos Mesh as a replacement for monitoring. It reveals failure modes; it does not tell you the business impact. Run Chaos Mesh alongside a load test to measure impact quantitatively. --- ## k6 — Performance Testing Under Real Load ### What It Is k6 is a developer-centric load testing tool. Scripts are written in JavaScript, tests run as Kubernetes Jobs (via the k6 Operator), and results come out as structured metrics. It's purpose-built for testing services that run inside Kubernetes — no external load generator needed, no egress costs, no network hops across datacenters. ``` ┌─────────────────────────────────────────────────────┐ │ k6 Operator Architecture │ │ │ │ ┌─────────────┐ │ │ │ TestRun │ (CRD you create) │ │ │ CRD │ │ │ └──────┬──────┘ │ │ │ operator watches │ │ ▼ │ │ ┌─────────────────┐ │ │ │ k6 Operator │ (Deployment) │ │ └──────┬──────────┘ │ │ │ creates │ │ ▼ │ │ ┌──────────────────────────────────────┐ │ │ │ k6 Jobs (parallelism: N) │ │ │ │ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │ │ │ Job 1 │ │ Job 2 │ │ Job N │ │ │ │ │ │ VUs:10 │ │ VUs:10 │ │ VUs:10 │ │ │ │ │ └────────┘ └────────┘ └────────┘ │ │ │ └──────────────────────────────────────┘ │ │ │ HTTP traffic │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Target Service │ (your workload) │ │ └─────────────────┘ │ └─────────────────────────────────────────────────────┘ ``` ### Test Structure A k6 test has three parts: options (what load to generate), default function (what to do per VU per iteration), and checks (assertions on responses). ```javascript export const options = { stages: [ { duration: '30s', target: 10 }, // ramp up: 0 → 10 virtual users { duration: '1m', target: 10 }, // hold: 10 VUs for 1 minute { duration: '30s', target: 0 }, // ramp down: 10 → 0 ], thresholds: { http_req_duration: ['p(95)<500'], // 95% of requests under 500ms http_req_failed: ['rate<0.01'], // error rate under 1% }, }; export default function () { const res = http.get('http://nginx-test.default.svc.cluster.local'); check(res, { 'status 200': (r) => r.status === 200, 'duration < 200ms': (r) => r.timings.duration < 200, }); sleep(1); } ``` The `stages` array defines a load profile. Thresholds define pass/fail criteria — if either threshold is violated, the test exits with a non-zero code, which makes it CI-friendly. ### Running Tests ```bash kubectl apply -f k6-test.yaml kubectl get testrun -n k6 -w ``` ``` NAME STAGE AGE nginx-load-test started 5s nginx-load-test running 12s nginx-load-test finished 2m18s ``` Collect results: ```bash kubectl logs -n k6 -l k6_cr=nginx-load-test ``` ``` ✓ status 200 ✓ duration < 200ms checks.........................: 100.00% ✓ 620 ✗ 0 data_received..................: 524 kB 3.9 kB/s data_sent......................: 52 kB 388 B/s http_req_blocked...............: avg=122µs min=2µs med=4µs max=12ms http_req_duration..............: avg=12ms min=3ms med=10ms max=89ms { expected_response:true }: avg=12ms http_req_failed................: 0.00% ✓ 0 ✗ 620 http_reqs......................: 620 4.63/s iteration_duration.............: avg=1.01s min=1s med=1s max=1.09s vus............................: 1 min=1 max=10 ``` ### Reading k6 Metrics The most important metrics: **`http_req_duration`** — end-to-end request latency. Look at `p(95)` and `p(99)`, not `avg`. Average latency hides tail latency that real users experience. **`http_req_failed`** — percentage of requests that returned errors (status >= 400 or network errors). Even 0.1% failure rate at scale is significant. **`http_req_blocked`** — time waiting for a TCP connection slot. High values indicate connection pool exhaustion. **`iteration_duration`** — total time per VU iteration including sleep. Use this to calculate effective throughput. ### Combining k6 with Chaos The real power of k6 in a testing lab is running it simultaneously with Chaos Mesh experiments: ``` Timeline: 00:00 k6 load test starts — 10 VUs hitting nginx-test 01:00 Chaos Mesh injects pod-kill on nginx-test 01:04 Pod recovered — k6 shows error spike then recovery 02:00 Chaos Mesh injects 100ms network delay 02:30 k6 p(95) climbs from 12ms to 118ms 03:00 Network chaos ends — k6 metrics normalize 04:00 k6 test ends ``` This workflow answers the question that neither tool answers alone: not just "does the pod restart?" but "how many requests failed during the restart, and how long did recovery take?" ### When to Use k6 Use k6 when you need to answer: - What is my service's throughput at 10, 50, 100 concurrent users? - Where does latency start to degrade? - What's the actual error rate under load, not just under zero load? - Does my HPA (Horizontal Pod Autoscaler) trigger at the right time? - How does my service behave when a downstream dependency is slow? k6 is not a monitoring tool and not a synthetic uptime checker. It generates sustained load to find the breaking point. Run it in your test lab, not against production, unless you have very carefully scoped test scripts. --- ## Trivy Operator — Continuous Vulnerability Scanning ### What It Is Trivy is a vulnerability scanner. The Trivy Operator runs as a controller inside Kubernetes and automatically scans every container image deployed in the cluster. It doesn't require you to trigger scans — it watches for new workloads and scans them as they appear. Beyond container images, Trivy also scans Kubernetes resource configurations (ConfigMaps, Deployments, RBACs) against a set of known misconfiguration rules derived from NSA/CISA guidelines and CIS benchmarks. ``` ┌────────────────────────────────────────────────────────┐ │ Trivy Operator Architecture │ │ │ │ Kubernetes API Server │ │ │ watches │ │ ▼ │ │ Trivy Operator (Deployment in trivy-system ns) │ │ │ │ │ ├── detects new Pod/ReplicaSet │ │ │ │ │ │ │ ▼ │ │ │ Scan Job (ephemeral) │ │ │ │ pulls image + scans │ │ │ ▼ │ │ │ VulnerabilityReport CRD │ │ │ │ │ └── detects new Deployment/ConfigMap/RBAC │ │ │ │ │ ▼ │ │ ConfigAuditReport CRD │ └────────────────────────────────────────────────────────┘ ``` ### Reading Vulnerability Reports ```bash kubectl get vulnerabilityreports -A ``` ``` NAMESPACE NAME CRITICAL HIGH MEDIUM LOW default replicaset-nginx-test-abc123 0 3 12 8 trivy-system replicaset-trivy-operator-xyz456 0 1 4 2 chaos-mesh daemonset-chaos-daemon-abc789 0 0 2 1 k6 deployment-k6-operator-def012 0 2 6 3 ``` Drill into a specific report: ```bash kubectl describe vulnerabilityreport replicaset-nginx-test-abc123 -n default ``` ``` Spec: Artifact: Digest: sha256:4bf0762cb... Repository: library/nginx Tag: latest Report: Vulnerabilities: - VulnerabilityID: CVE-2023-44487 Severity: HIGH Title: HTTP/2 Rapid Reset Attack InstalledVersion: 1.25.3 FixedVersion: 1.25.4 Description: ... - VulnerabilityID: CVE-2024-21626 Severity: HIGH Title: runc container breakout InstalledVersion: 1.1.9 FixedVersion: 1.1.12 ``` ### Reading Config Audit Reports ```bash kubectl get configauditreports -A ``` ``` NAMESPACE NAME CRITICAL HIGH MEDIUM LOW default replicaset-nginx-test-abc123 0 2 3 5 ``` ```bash kubectl describe configauditreport replicaset-nginx-test-abc123 -n default ``` ``` Report: Checks: - CheckID: KSV014 Severity: HIGH Title: Root file system is not read-only Message: Container 'nginx' of ReplicaSet 'nginx-test-abc123' should set 'securityContext.readOnlyRootFilesystem' to true - CheckID: KSV003 Severity: HIGH Title: No capabilities drop defined Message: Container 'nginx' should drop ALL capabilities ``` These findings are actionable: add `readOnlyRootFilesystem: true` and `capabilities: {drop: [ALL]}` to the container security context. ### Trivy Severity Levels |Severity|Meaning|Action| |---|---|---| |CRITICAL|Remote code execution, privilege escalation|Fix immediately — update image| |HIGH|Significant data exposure or system compromise|Fix within sprint| |MEDIUM|Limited impact, requires other conditions|Fix in regular maintenance| |LOW|Minimal impact or theoretical|Track and accept or fix| |UNKNOWN|Insufficient data to score|Investigate manually| The `ignoreUnfixed: true` setting (enabled in the Terraform config) filters out CVEs that have no available fix — these clutter reports without providing actionable guidance. ### When to Use Trivy Trivy runs continuously — you don't "use" it so much as read it periodically. Build it into your workflow: - **Daily**: check for new CRITICAL/HIGH findings via `kubectl get vulnerabilityreports -A` - **Before deploying new images**: add `trivy image <image>` to your CI pipeline - **After major Kubernetes upgrades**: re-scan all workloads for new CVEs against updated components - **After security incidents**: use ConfigAuditReports to check for misconfigurations that may have contributed The key discipline with Trivy is not letting reports accumulate without action. A list of 200 unaddressed findings becomes noise. Triage weekly, fix CRITICAL findings same-day, and track HIGH findings in your issue tracker. --- ## kube-bench — CIS Security Benchmark Auditing ### What It Is kube-bench runs the CIS (Center for Internet Security) Kubernetes Benchmark against your cluster. The CIS Benchmark is the industry standard for Kubernetes security hardening — 300+ checks across control plane, worker nodes, etcd, and policies. > [!note] > see example of report [Kubernetes-stresstest/kube-bench.sh at main · vladlevinas/Kubernetes-stresstest](https://github.com/vladlevinas/Kubernetes-stresstest/blob/main/kube-bench.sh) Unlike Trivy which scans workload content, kube-bench audits the cluster configuration itself: kubelet settings, API server flags, file permissions, authentication configuration, network policies, and RBAC setup. On managed Kubernetes like DOKS, many checks are controlled by the provider and not configurable by the user. kube-bench correctly identifies these and marks them as warnings with remediation notes that say "this is controlled by your provider." ### Running kube-bench In the Terraform setup, kube-bench runs as a Kubernetes Job immediately after cluster creation. It completes in about 60 seconds and exits. ```bash kubectl logs job/kube-bench --kubeconfig=kubeconfig.yaml ``` ``` [INFO] 4 Worker Node Security Configuration [INFO] 4.1 Worker Node Configuration Files [PASS] 4.1.1 Ensure that the kubelet service file permissions are set to 600 [PASS] 4.1.2 Ensure that the kubelet service file ownership is set to root:root [WARN] 4.1.3 If proxy kubeconfig file exists ensure permissions are set to 600 [PASS] 4.2.1 Ensure that the --anonymous-auth argument is set to false [PASS] 4.2.2 Ensure that the --authorization-mode argument is not set to AlwaysAllow [PASS] 4.2.6 Ensure that the --protect-kernel-defaults is set to true [FAIL] 4.2.11 Ensure that the RotateKubeletServerCertificate is set to true [INFO] 5 Kubernetes Policies [INFO] 5.1 RBAC and Service Accounts [WARN] 5.1.1 Ensure that the cluster-admin role is only used where required [FAIL] 5.1.6 Ensure that Service Account Tokens are not automatically mounted [PASS] 5.2.2 Minimize the admission of containers wishing to share the host PID [PASS] 5.2.3 Minimize the admission of containers with added capability [FAIL] 5.4.2 Ensure that all Namespaces have Network Policies defined == Remediations == 4.2.11 Edit the kubelet configuration file /var/lib/kubelet/config.yaml and set: RotateKubeletServerCertificate: true Note: On managed clusters (DOKS, GKE, EKS), this may be controlled by the provider. 5.1.6 Apply automountServiceAccountToken: false to service accounts that do not require API access. 5.4.2 Create NetworkPolicy objects for each namespace to restrict pod-to-pod traffic appropriately. == Summary node == 19 checks PASS 3 checks FAIL 4 checks WARN 0 checks INFO ``` ### Interpreting Results kube-bench output has four categories: **PASS** — configuration matches the CIS recommendation. No action needed. **FAIL** — configuration does not match. Remediation is described in the output. On managed Kubernetes, some FAILs are expected because the provider controls those settings (kubelet configuration, API server flags). **WARN** — check could not be fully automated or requires manual verification. The output explains what to check manually. **INFO** — informational finding, no action required. ### Actionable vs Non-Actionable Findings on DOKS On DigitalOcean managed Kubernetes, expect approximately: - 15-20 PASS on worker node checks - 2-4 FAIL on checks controlled by the provider (non-actionable) - 3-5 FAIL on policy checks (actionable — NetworkPolicies, RBAC, service account tokens) - Several WARN on configuration that requires manual inspection The actionable failures for a typical cluster are: |Check|Finding|Fix| |---|---|---| |5.1.6|Service Account Tokens auto-mounted|Add `automountServiceAccountToken: false` to service accounts| |5.4.2|No NetworkPolicies defined|Create default-deny NetworkPolicy per namespace| |5.1.1|cluster-admin used broadly|Audit ClusterRoleBindings, replace with least-privilege roles| |5.2.6|Containers running as root|Add `runAsNonRoot: true` to pod security context| ### When to Use kube-bench kube-bench is not a continuous monitoring tool — it's a point-in-time audit. Run it: - After initial cluster provisioning (already done in the Terraform setup) - After major Kubernetes version upgrades - Before production go-live or SOC2/ISO27001 audits - After significant RBAC or workload configuration changes - Quarterly as part of a security review cycle The output doubles as a hardening checklist. Work through the FAIL items one by one, distinguishing between provider-controlled (document and accept) and user-controlled (fix). --- ## Goldilocks — Resource Optimization via VPA Recommendations ### What It Is Goldilocks solves the resource request guessing problem. Most engineers set CPU and memory requests/limits based on intuition, copy-paste from documentation, or not at all. Both extremes are harmful: over-provisioned workloads waste money and reduce scheduling density; under-provisioned workloads get OOMKilled or CPU-throttled under load. Goldilocks runs the Kubernetes Vertical Pod Autoscaler (VPA) in recommendation-only mode on every workload in labeled namespaces, then presents those recommendations in a dashboard. The VPA watches actual CPU and memory usage over time and produces statistically sound recommendations based on real consumption patterns. ``` ┌─────────────────────────────────────────────────────────┐ │ Goldilocks Architecture │ │ │ │ ┌─────────────────────────────────────┐ │ │ │ Namespace (label: goldilocks=true) │ │ │ │ │ │ │ │ Deployment: nginx-test │ │ │ │ Pod: running, consuming resources │ │ │ └──────────────┬──────────────────────┘ │ │ │ metrics │ │ ▼ │ │ VPA (recommendation mode only — never resizes pods) │ │ │ recommendations │ │ ▼ │ │ Goldilocks Controller │ │ │ reads + aggregates │ │ ▼ │ │ Goldilocks Dashboard (:32080) │ │ │ displays per deployment │ │ ▼ │ │ ┌─────────────────────────────────┐ │ │ │ nginx-test │ │ │ │ CPU req: 15m lim: 15m │ │ │ │ Mem req: 32Mi lim: 32Mi │ │ │ └─────────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘ ``` **Critical distinction:** VPA in recommendation mode never changes anything. It only watches and suggests. No pods are restarted, no resources are modified. This is the safe way to use VPA. ![[Pasted image 20260316193408.png|1200]] ### Reading Goldilocks Output Open `http://NODE-IP:32080`: ``` Namespace: default Deployment: nginx-test Container: nginx QoS Policy: Guaranteed (request == limit) Current settings: CPU req: - CPU lim: - Mem req: - Mem lim: - Recommended: CPU req: 15m CPU lim: 15m Mem req: 32Mi Mem lim: 32Mi Burstable policy: CPU req: 15m CPU lim: 1000m Mem req: 32Mi Mem lim: 500Mi ``` Goldilocks shows two recommendation modes: **Guaranteed** — request equals limit. The pod gets exactly what it asks for, no more. Best for predictable, steady-state workloads. The pod will never be CPU-throttled but will be OOMKilled if it spikes above the limit. **Burstable** — request is the minimum guarantee, limit is the ceiling. The pod can burst above its request when node resources are available. Best for workloads with variable load patterns. ### What the Metrics Mean **CPU request** — the minimum CPU the scheduler guarantees. If you set 15m (millicores), the pod is guaranteed 1.5% of a vCPU. **CPU limit** — the hard cap. CPU is throttled at this value even if node has free capacity. Setting CPU limits too low causes `CpuThrottling` — the application runs slowly without any visible error. **Memory request** — the scheduler guarantee. Used for bin-packing decisions. **Memory limit** — the hard cap. Exceeding this kills the pod with `OOMKilled`. Unlike CPU throttling, there's no graceful degradation — the process is terminated immediately. ### Workflow: Optimize Then Test The correct workflow for Goldilocks is iterative: ``` 1. Deploy workload with no resource limits 2. Run k6 load test (realistic traffic) 3. Wait 10-15 minutes for VPA to collect data 4. Read Goldilocks recommendations 5. Update deployment manifests with recommended values 6. Run k6 load test again 7. Verify: no OOMKills, no CPU throttling, no degraded latency 8. Repeat for each environment (test → staging → production) ``` Step 6 is often skipped. Don't skip it. Adding resource limits to a previously unconstrained pod can cause surprising performance regressions if the limits are tighter than the workload's actual burst behavior. ### When to Use Goldilocks Use Goldilocks when: - You're setting resource limits for the first time on a new workload - You've been running workloads with no limits and want to add them safely - You're experiencing OOMKills but don't know the right memory limit - Nodes are running out of capacity and you suspect over-provisioning - You're preparing cost optimization work — right-sized requests reduce waste Don't trust Goldilocks recommendations on workloads that have only been running for a few minutes. The VPA needs sustained traffic — ideally covering your peak load period — to produce accurate recommendations. Run your k6 load tests before reading Goldilocks output. --- ## Comparative Analysis ### Tool Comparison Table ||Chaos Mesh|k6|Trivy|kube-bench|Goldilocks| |---|---|---|---|---|---| |**Category**|Resilience|Performance|Security|Security|Efficiency| |**What it tests**|Failure recovery|Load capacity|Image CVEs + config|CIS compliance|Resource sizing| |**When it runs**|On demand|On demand|Continuously|One-shot|Continuously| |**Output type**|Pod events, dashboard|Metrics, pass/fail|VulnerabilityReports|Log output|Dashboard| |**Affects workloads?**|Yes (by design)|Yes (generates load)|No|No|No| |**Requires traffic**|No|Generates it|No|No|Yes| |**CI/CD friendly**|Partially|Yes|Yes|Yes|No| |**Dashboard**|Yes (:32333)|No|No|No|Yes (:32080)| |**External access**|NodePort|No|No|No|NodePort| ### Decision Guide: Which Tool for Which Problem ``` Problem: "We don't know if our pods recover after a crash" → Chaos Mesh (PodChaos, pod-kill action) Problem: "We don't know how many users our service can handle" → k6 (load test with ramp-up stages and thresholds) Problem: "Our container images might have unpatched CVEs" → Trivy Operator (scan all running images) Problem: "We need to pass a security audit" → kube-bench (CIS benchmark, document findings) Problem: "Our pods keep getting OOMKilled" → Goldilocks (read VPA memory recommendations) Problem: "We need to prove our service handles failure gracefully" → Chaos Mesh + k6 running simultaneously Problem: "We're over-spending on node capacity" → Goldilocks (identify over-provisioned workloads) Problem: "Security team wants CVE report for all running workloads" → kubectl get vulnerabilityreports -A -o json ``` ### Combining Tools for Maximum Coverage The tools are most powerful in combination. Three workflows cover most production readiness requirements: **Workflow 1: Pre-production resilience gate** 1. Deploy workload to test cluster 2. Run k6 at expected production load (baseline) 3. Run Chaos Mesh pod-kill during k6 test 4. Assert: error rate stays below 1%, recovery time under 10 seconds 5. Run Chaos Mesh network delay (100ms) during k6 test 6. Assert: p95 latency stays below SLA **Workflow 2: Security clearance before deployment** 1. Run `trivy image <new-image>` in CI — fail pipeline on CRITICAL 2. After deploy to test cluster, check `kubectl get vulnerabilityreports` 3. Run kube-bench if cluster configuration changed 4. Review ConfigAuditReports for new misconfigurations **Workflow 3: Resource right-sizing** 1. Deploy workload without resource limits 2. Run k6 at peak expected load for 10 minutes 3. Read Goldilocks dashboard 4. Apply recommended limits to deployment 5. Run k6 again — verify no regressions --- ## The Testing Loop All five tools together form a testing loop that covers the full lifecycle of a workload: ``` ┌─────────────────────────────────────────────────────────────────┐ │ Testing Loop │ │ │ │ ┌─────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Deploy │────►│ k6 load │────►│Goldilocks│ │ │ │workload │ │ test │ │ sizing │ │ │ └─────────┘ └──────────┘ └────┬─────┘ │ │ ▲ │ │ │ │ apply limits │ recommendations │ │ └────────────────────────────────┘ │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Chaos │ │ Trivy │ │ kube- │ │ │ │ Mesh │ │ scan │ │ bench │ │ │ │resilience│ │ security │ │ audit │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ │ │ └────────────────┴────────────────┘ │ │ │ │ │ fix findings │ │ │ │ │ ▼ │ │ production ready │ └─────────────────────────────────────────────────────────────────┘ ``` No workload should reach production without passing through all five lenses. In practice, run Trivy and kube-bench first (they find blocking issues fastest), then size with Goldilocks, then validate resilience and performance with Chaos Mesh and k6. --- ## Conclusion Kubernetes testing is not one thing. A pod that survives chaos experiments might still fall over under load. A load-tested service might run with unpatched CVEs. A security-audited cluster might have workloads without resource limits, causing cascading evictions under traffic spikes. The five tools in this stack cover the blind spots that individual approaches miss. More importantly, they're not expensive or complex to run — the entire stack deploys in under 10 minutes on a $24/month cluster and costs nothing when idle. The infrastructure-as-code approach (single `terraform apply`, single `terraform destroy`) removes the friction that usually prevents teams from building test environments. There's no excuse to skip testing when the environment is disposable. --- _Full Terraform source code and setup guide: [doc.thedevops.dev](https://doc.thedevops.dev/)_ _Follow for more content on Kubernetes, AI infrastructure, and DevOps automation._