AI-Powered GitOps. Canary Deployments with MCP, Traefik v3 and ArgoCD

_How I used Claude AI and four MCP servers to build a production-realistic canary deployment pipeline — with weighted traffic splitting, automatic SSL, and human-controlled Pull Request approval — without writing a single manifest by hand._ --- ## Introduction ### The Question That Started It What if you could manage production Kubernetes infrastructure by just… talking to it? Not describing what you want and then copying YAML snippets into a terminal. Not asking an AI to generate a config file you then manually apply. A real conversation that results in actual Git commits, actual ArgoCD syncs, actual traffic shifting on a live cluster — with a human reviewing and approving every change before it touches production. That was the experiment. This article is what happened. ### What This Article Is NOT This is not a tutorial on how to install MCP. It is not a ChatGPT wrapper demo showing an AI generate a `hello-world` deployment. This is a real K3s cluster, real Cloudflare DNS, real Let's Encrypt TLS certificates, real Traefik weighted routing — and a real canary deployment that you can test yourself right now at **https://nginx.thedevops.dev**. ### The Three Rules I Set To keep the experiment honest, I gave myself three hard constraints: 1. **No manual `kubectl apply`** — every cluster change must come from a Git commit that ArgoCD syncs 2. **No direct editing of cluster resources** — if something is wrong, fix it in Git, not in the cluster 3. **Every infrastructure change must go through a Pull Request** — the AI proposes, the human approves and merges These constraints made the experiment harder. They also made it much more interesting — and much more representative of how a real team would actually work. --- ## Part 1 — The Idea & Problem Statement ### What I Wanted to Build The goal was simple to describe but complex to implement: > Deploy a production-realistic nginx web server with 3 replicas, then introduce a canary release — all managed through a conversational AI interface, with proper subdomains, SSL certificates, and human-controlled Pull Request approval before anything touches the cluster. Most canary deployment tutorials give you half the picture. They show you the Kubernetes manifests, or the ingress config, or the cert-manager setup — but rarely all of it together, end-to-end, in a way that reflects how a real team would actually work. I wanted to build the **full loop**: 1. A real web server — not a hello-world toy, but a proper nginx deployment with 3 replicas spread across nodes, resource limits, health probes, and a custom-designed frontend 2. A canary version — visually distinct, with version headers, deployed in full isolation in its own namespace 3. Weighted traffic splitting on a **single domain** — not two separate subdomains, but one URL that silently routes 90% of users to stable and 10% to canary 4. Automatic subdomain creation with real SSL certificates — `nginx.thedevops.dev` provisioned via Cloudflare DNS and secured with Let's Encrypt, without touching a browser 5. A proper GitOps change control flow — every infrastructure change proposed as a Git commit on a feature branch, reviewed as a Pull Request, and only applied to the cluster **after a human approves and merges** ### Why This Is Harder Than It Sounds The individual pieces are well-documented. The combination is not. - **Traefik v3** introduced stricter cross-namespace security defaults that silently break patterns from every v2 tutorial - **cert-manager** behaves differently with Traefik's `IngressRoute` CRD than with standard `Ingress` objects — the usual annotation approach simply does not work - **ArgoCD + Gitea** works beautifully in steady state, but the interaction between force-sync timing, PR merges, and automated sync policies requires careful sequencing - **Canary on a single domain** requires a `TraefikService` weighted load balancer, an ExternalName service as a namespace bridge, and coordinated IngressRoutes — none of which are covered together in a single place The AI didn't just generate YAML. It navigated these problems in real time — reading pod events, identifying mismatches, applying fixes through Git, and re-syncing — the same way a senior engineer would during an incident. --- ## Part 2 — The Stack ### Infrastructure Overview |Component|Version|Role| |---|---|---| |K3s|v1.32|Kubernetes distribution, 3 master nodes| |Traefik|v3.3.6|Ingress controller, bundled with K3s| |ArgoCD|latest|GitOps sync engine| |Gitea|latest|Self-hosted Git backend| |cert-manager|latest|Automatic TLS certificate provisioning| |Cloudflare|—|DNS management, zone `thedevops.dev`| The cluster runs 3 master nodes — no dedicated workers. Each node participates in both control plane and data plane duties, which is common for homelab and demo environments but also mirrors many small production setups. ### The AI Layer — What MCP Actually Is MCP (Model Context Protocol) is an open standard that lets AI models communicate with external tools and services through a structured interface. Think of it as a USB standard for AI integrations — instead of building a custom integration for every tool, you expose the tool through an MCP server and any compatible AI can use it. In this setup, Claude AI orchestrates four MCP servers simultaneously, each controlling a different layer of the infrastructure: |MCP Server|What it controls|URL| |---|---|---| |Kubernetes|kubectl operations, pod inspection, logs, events|internal| |ArgoCD|sync, health status, application management|https://argocd.thedevops.dev| |Gitea|file commits, branches, pull requests|https://git.thedevops.dev| |Cloudflare|DNS A records, TTL, proxy toggle|cloudflare.com| ### Full Architecture ``` You (plain English instruction) ↓ Claude AI ↓ ┌─────────┼──────────┬──────────┐ ↓ ↓ ↓ ↓ Cloudflare Gitea Kubernetes ArgoCD (DNS) (GitOps) (cluster) (sync) ↓ ↑ main branch ────────────┘ (source of truth) ``` The key insight is that Gitea is the **single source of truth**. The AI never applies resources directly to the cluster — it commits to Git, and ArgoCD reconciles the cluster state to match. This means every action is auditable, reversible, and human-reviewable. ### What the AI Can and Cannot Do **Can do:** - Create, update, delete files in Gitea - Read pod logs and events via kubectl - Create and delete Cloudflare DNS records - Force ArgoCD sync operations - Open Pull Requests for human review **Cannot do (by design):** - Merge its own Pull Requests - Apply manifests directly to the cluster - Override ArgoCD's sync policy - Access secrets or credentials stored in the cluster --- ## Part 3 — The PR-Based GitOps Workflow ### Why Autonomous AI Commits to Main Is Wrong Many AI infrastructure demos show the AI doing everything in one shot — it writes the manifests, applies them, and the cluster is updated. That is impressive for a demo. It is also a pattern you should never use in production. The problem is not that AI makes mistakes (it does). The problem is that there is no human checkpoint between "AI decided to do something" and "something happened in production." In any responsible engineering culture, infrastructure changes — especially anything touching live traffic routing — require a review step. The PR workflow solves this exactly. ### The Flow ``` AI creates feature branch ↓ AI commits all manifests there ↓ AI opens Pull Request ↓ Human reviews diff in Gitea UI ← YOU ARE HERE ↓ Human merges PR ↓ ArgoCD detects change on main ↓ ArgoCD syncs cluster (30 seconds) ↓ AI verifies deployment health ``` ### Who Does What |Step|Who| |---|---| |Write all YAML manifests|AI| |Create feature branch|AI| |Commit files with conventional commit messages|AI| |Open Pull Request with description|AI| |Review the diff|**Human**| |Approve and merge|**Human**| |Detect new commit on main|ArgoCD| |Apply resources to cluster|ArgoCD| |Verify pods running, check logs|AI| This is the right balance. The AI handles the tedious work — writing correct Kubernetes YAML, remembering namespace labels, getting cert-manager syntax right. The human handles the judgment call — is this change safe to apply right now?![[Pasted image 20260304203539.png]] --- ## Part 4 — Building the Stable App ### The Request The entire stable deployment was triggered by one plain English instruction: > _"Deploy nginx with 3 replicas in namespace nginx-mcp, custom purple HTML page showing client IP, with ArgoCD application for GitOps sync."_ ### What the AI Created Five files were committed to a feature branch in Gitea under `apps/nginx-mcp/`: **namespace.yaml** — Simple namespace definition with a `track: stable` label. The label matters — it makes it easy to filter resources by track when running kubectl commands later. **configmap.yaml** — This is where the custom frontend lives. A dark purple gradient page with: - "Hello from MCP" as the main heading - "MCP Powered" badge - "v1 stable" version indicator - Live client IP detection via the ipify API (`https://api.ipify.org`) - A custom `nginx.conf` embedded as a second key in the same ConfigMap The ipify integration is a nice touch for demos — it shows the actual visitor IP address on the page, making each request feel unique and personal. **deployment.yaml** — Three replicas of `nginx:1.25-alpine` with: - ConfigMap volume mounts for both `index.html` and `nginx.conf` - Resource limits: `cpu: 200m`, `memory: 128Mi` - Requests: `cpu: 50m`, `memory: 64Mi` - Readiness probe on `GET /` (delays traffic until nginx is serving) - Liveness probe on `GET /` (restarts container if it stops responding) **service.yaml** — ClusterIP on port 80. No external exposure here — that is handled by the Traefik IngressRoute in the weighted layer. **application.yaml** — The ArgoCD Application resource that wires everything together: ```yaml syncPolicy: automated: prune: true # delete resources removed from Git selfHeal: true # revert manual cluster changes syncOptions: - CreateNamespace=true ``` The `selfHeal: true` setting is critical for the drift detection demo — it means any manual `kubectl patch` or `kubectl edit` on cluster resources is automatically reverted within seconds. ### The Pull Request The AI opened a PR from `feat/nginx-mcp-stable` to `main`. In the Gitea UI at **https://git.thedevops.dev/admin/k3s-gitops**, the diff shows exactly five files, all new additions. You can see the full commit history, the author (`Claude AI`), the conventional commit message (`feat: add nginx-mcp stable app`), and the timestamp. After reviewing the manifests and confirming the configuration looks correct, the PR is merged. ArgoCD detects the new commit on `main` within its polling interval — or immediately if a force sync is triggered. ### After Merge Thirty seconds after merge: 3/3 pods running across all three cluster nodes. ``` nginx-mcp-645797ddc8-4hll5 1/1 Running master1 nginx-mcp-645797ddc8-m5gdr 1/1 Running master2 nginx-mcp-645797ddc8-mbrr8 1/1 Running master3 ``` Each pod on a different node — Kubernetes spread them automatically due to the default topology spread constraints. --- ## Part 5 — DNS and SSL: Fully Automated ### The Request > _"Create DNS records: nginx.thedevops.dev, nginx-canary.thedevops.dev, nginx-stable.thedevops.dev — all pointing to the cluster IP, no proxy, TTL 1."_ Three Cloudflare DNS A records created in under 5 seconds via the Cloudflare MCP server. No browser, no dashboard, no copy-pasting IP addresses. ### The cert-manager Trap This is where most people hit a wall they don't expect. When using a standard Kubernetes `Ingress` object, cert-manager watches for the `cert-manager.io/cluster-issuer` annotation and automatically provisions a certificate. This is well documented and works reliably. **Traefik's `IngressRoute` CRD does not work this way.** `IngressRoute` is a custom resource defined by Traefik. cert-manager does not know how to read Traefik CRDs — it only understands standard `Ingress` objects. If you add cert-manager annotations to an `IngressRoute`, they are silently ignored. The certificate is never requested. The secret never appears. Traefik logs a TLS error. And nothing in the cert-manager logs tells you why, because from cert-manager's perspective, nothing happened. **The fix** is to create an explicit `Certificate` resource: ```yaml apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: nginx-weighted-tls namespace: nginx-mcp spec: secretName: nginx-weighted-tls issuerRef: name: letsencrypt-http kind: ClusterIssuer dnsNames: - nginx.thedevops.dev ``` This tells cert-manager directly: "I want a certificate for this domain, store it in this secret." No annotation magic required. cert-manager provisions the Let's Encrypt certificate, populates the secret, and Traefik picks it up automatically. ### HTTP to HTTPS Redirect A Traefik `Middleware` resource handles the redirect: ```yaml apiVersion: traefik.io/v1alpha1 kind: Middleware metadata: name: redirect-https namespace: nginx-mcp spec: redirectScheme: scheme: https permanent: true ``` This middleware is applied to the HTTP `IngressRoute` so that any request to `http://nginx.thedevops.dev` is permanently redirected to HTTPS before Traefik even attempts to route it. --- ## Part 6 — Building the Canary App ### Design Decisions The canary version runs in a completely separate namespace: `nginx-canary`. This is intentional. Namespace isolation means: - The canary deployment has its own resource quotas - RBAC policies can differ between stable and canary - A cascading failure in canary cannot affect stable namespace resources - ArgoCD manages them as independent applications with independent sync states ![[Pasted image 20260304204604.png]] ### Visual Difference by Design The canary page is intentionally very different from stable — dark orange/amber gradient instead of purple, with explicit visual indicators: - **"Canary Release"** badge in amber - **"v2.0.0"** version indicator - **Warning banner**: "⚠ This is a CANARY build. Not yet promoted to stable." - **Info grid** showing: track=canary, version=v2.0.0, image=nginx:1.25-alpine, replicas=2 - Live IP detection via ipify (same as stable) This visual difference is not just cosmetic — it makes it immediately obvious which version you are on when testing the traffic split. No guessing from page content. ![[Pasted image 20260304204428.png]] ### Response Headers The canary `nginx.conf` adds two custom response headers on every request: ```nginx add_header X-Version "v2.0.0"; add_header X-Track "canary"; ``` These headers are invisible to normal browser users but are extremely useful for: - Automated testing scripts that verify which version handled a request - Debugging which backend served a specific request - Monitoring systems that track canary adoption rates You can check them yourself: ```bash curl -sI https://nginx.thedevops.dev | grep -E "X-Version|X-Track" ``` If you got the canary backend, you will see: ``` x-version: v2.0.0 x-track: canary ``` If you got stable — no headers. That asymmetry is itself useful information. --- ![[Pasted image 20260304205436.png]] ## Part 7 — Weighted Traffic Splitting: One Domain, Two Versions ### The Problem With Two Subdomains A naive approach to canary deployment is to create two separate DNS entries: `nginx-stable.thedevops.dev` and `nginx-canary.thedevops.dev`. Give users one or the other. This is **not canary deployment**. This is just two separate services. Real canary means a single entry point where the routing decision is made transparently, at the infrastructure layer, without the user's knowledge or involvement. The user always visits the same URL. The infrastructure decides which version they see. ### Traefik TraefikService — Weighted Routing Traefik has a first-class resource for this: `TraefikService` with a `weighted` spec. ```yaml apiVersion: traefik.io/v1alpha1 kind: TraefikService metadata: name: nginx-weighted namespace: nginx-mcp spec: weighted: services: - name: nginx-mcp namespace: nginx-mcp port: 80 weight: 90 # ← stable gets 90% of traffic - name: nginx-canary-proxy namespace: nginx-mcp port: 80 weight: 10 # ← canary gets 10% of traffic ``` Traefik reads this resource and distributes incoming requests proportionally. No session stickiness by default — each request is independently routed. This means the same browser making multiple requests will sometimes hit stable and sometimes canary. ### The Cross-Namespace Problem Here is where Traefik v3 introduced a breaking change from v2 that catches almost everyone. The canary service lives in the `nginx-canary` namespace. The `TraefikService` lives in the `nginx-mcp` namespace. In Traefik v2, you could simply reference the service with its full namespace path. In Traefik v3, cross-namespace service references in `TraefikService` are **blocked by default** for security reasons. The error is not loud. Traefik logs something like: ``` the service "nginx-mcp-nginx-canary@kubernetescrd" does not exist ``` There is no mention of "cross-namespace" or "permission denied." You just see the service as not existing, even though it clearly does. The solution is a two-part fix. **Part 1 — Enable the required Traefik flags:** ``` --providers.kubernetescrd.allowExternalNameServices=true --providers.kubernetescrd.allowCrossNamespace=true ``` In K3s, these go into the `HelmChartConfig` resource for Traefik (more on this in Part 9). **Part 2 — ExternalName service as namespace bridge:** Even with the flags enabled, the cleanest approach is to create an `ExternalName` service in the _same_ namespace as the `TraefikService`: ```yaml apiVersion: v1 kind: Service metadata: name: nginx-canary-proxy namespace: nginx-mcp # same namespace as TraefikService spec: type: ExternalName externalName: nginx-canary.nginx-canary.svc.cluster.local ports: - port: 80 ``` This creates a local proxy within `nginx-mcp` namespace that forwards to the actual canary service across namespaces. The `TraefikService` only sees local services — no cross-namespace reference needed. ### Final Routing Architecture ``` Browser → https://nginx.thedevops.dev ↓ Traefik IngressRoute (websecure) tls: nginx-weighted-tls secret ↓ TraefikService: nginx-weighted ┌──────────┴──────────┐ 90% 10% ↓ ↓ nginx-mcp nginx-canary-proxy Service (ExternalName) (ClusterIP) ↓ ↓ nginx-canary.nginx-canary 3 stable pods .svc.cluster.local (purple, v1) ↓ 2 canary pods (orange, v2) ``` All files for the weighted layer live under `apps/nginx-weighted/` in the Gitea repo: - `canary-proxy-svc.yaml` — the ExternalName bridge - `traefikservice.yaml` — the weights (the only file you need to change for traffic shifting) - `certificate.yaml` — explicit cert-manager certificate - `middleware.yaml` — HTTP→HTTPS redirect - `ingressroute.yaml` — two IngressRoutes (websecure + web redirect) - `application.yaml` — ArgoCD app pointing to this directory Browse the full source at: **https://git.thedevops.dev/admin/k3s-gitops/src/branch/main/apps** --- ## Part 8 — Try It Yourself: Testing the Live Demo The demo is live. Here is exactly how to see canary deployment in action. ### Live URLs | What | URL | What you will see | | ------------------------ | ------------------------------------------ | ------------------------------------- | | **Weighted entry point** | https://nginx.thedevops.dev | Purple or orange — depends on routing | | **Stable direct** | https://nginx-stable.thedevops.dev | Always purple, v1 stable | | **Canary direct** | https://nginx-canary.thedevops.dev | Always orange, v2 canary | | **Gitea repo** | https://git.thedevops.dev/admin/k3s-gitops | All manifests, full commit history | | **ArgoCD dashboard** | https://argocd.thedevops.dev | Live sync status of all applications | ### Method 1 — Browser Refresh Test This is the simplest way to experience canary routing as a real user would. 1. Open **https://nginx.thedevops.dev** in your browser 2. Refresh the page 10 to 20 times 3. Watch the page background color Most refreshes will show the **dark purple** stable page. Roughly 1 in 10 refreshes will show the **dark orange** canary page. The URL in your address bar never changes. The switch between versions happens silently at the Traefik layer — you never see a redirect, never see a different domain. This is exactly what real users experience during a production canary rollout. They have no idea they are on a different version. The version badge on the page and the "⚠ CANARY build" warning exist only because this is a demo — in production, users typically see no visible difference. ### Method 2 — curl Loop Test For a more systematic test, run this from your terminal: ```bash for i in $(seq 1 20); do curl -s https://nginx.thedevops.dev | grep -q "Canary" \ && echo "🟠 canary" \ || echo "🟣 stable" done ``` Expected output at 90/10 split — approximately 18 stable, 2 canary: ``` 🟣 stable 🟣 stable 🟣 stable 🟣 stable 🟠 canary ← you got the canary! 🟣 stable 🟣 stable 🟣 stable 🟣 stable 🟣 stable 🟠 canary ← canary again 🟣 stable 🟣 stable ... ``` The distribution is probabilistic, not perfectly uniform — just like real traffic. You might get 3 canary hits in a row, or go 15 requests without one. On average over many requests, it converges to 90/10. ### Method 3 — Response Header Inspection The cleanest way to verify which version handled your request, without relying on HTML content: ```bash curl -sI https://nginx.thedevops.dev | grep -E "X-Version|X-Track" ``` **Canary response:** ``` x-version: v2.0.0 x-track: canary ``` **Stable response:** ``` (no headers — v1 does not add custom headers) ``` Run this in a loop to see the traffic split in action: ```bash for i in $(seq 1 30); do result=$(curl -sI https://nginx.thedevops.dev 2>/dev/null | grep "x-version" | head -1) if [ -n "$result" ]; then echo "🟠 CANARY — $result" else echo "🟣 stable — no version header" fi done ``` ### Method 4 — Watch a Live Weight Change This is the most impressive demonstration if you have access to both the browser and the Gitea repo simultaneously. 1. Open **https://nginx.thedevops.dev** in one browser tab — keep refreshing 2. Open **https://git.thedevops.dev/admin/k3s-gitops** in another tab 3. Open **https://argocd.thedevops.dev** in a third tab 4. Edit `apps/nginx-weighted/traefikservice.yaml` in Gitea — change `weight: 90` to `weight: 50` and `weight: 10` to `weight: 50` 5. Commit the change 6. Watch ArgoCD detect and sync the change (takes ~30 seconds) 7. Go back to the first tab — now roughly every other refresh hits canary The entire weight shift happens with no pod restarts, no downtime, no configuration reloads. Traefik picks up the new `TraefikService` weights dynamically. ### Understanding the Traffic Split Scenarios |Scenario|stable weight|canary weight|When to use| |---|---|---|---| |Initial canary test|90|10|First exposure — limit blast radius| |Extended testing|50|50|Canary looks healthy, testing at scale| |Full promote|0|100|Canary is the new stable| |Emergency rollback|100|0|Canary is causing issues — instant fix| ### Checking ArgoCD Live Open **https://argocd.thedevops.dev** to see the three ArgoCD applications: - **nginx-mcp** — the stable app in `nginx-mcp` namespace - **nginx-canary** — the canary app in `nginx-canary` namespace - **nginx-weighted** — the traffic routing layer (TraefikService, IngressRoute, Certificate) All three should show **Synced** and **Healthy** in green. Click on `nginx-weighted` to see the full resource tree: TraefikService, two IngressRoutes, Certificate, Middleware, and the ExternalName service. Click any resource to see the live YAML currently deployed in the cluster — the same content that is in Gitea, because ArgoCD enforces they are identical. ### Canary vs Blue/Green — The Key Difference While you are testing, it is worth understanding why this is canary and not blue/green: **Blue/Green:** 100% of traffic switches instantly from old version to new. You flip a switch — all users get the new version simultaneously. Rollback = flip the switch back. **Canary:** Traffic _bleeds_ gradually from old to new. 10% of users see the new version first. If metrics look good, increase to 30%, then 50%, then 100%. Each step is a Git commit. If something goes wrong at 10%, only 1 in 10 users was affected — and rollback is a single number change. The weight number in `traefikservice.yaml` is that dial. You control the blast radius of your deployment by controlling that number. --- ## Part 9 — The Hard Parts Nobody Documents ### 1. Traefik v3 Cross-Namespace Restriction **What breaks:** TraefikService cannot reference Kubernetes Services in other namespaces by default in Traefik v3. **How it fails:** Silently. No obvious error. Traefik logs show the service "does not exist" — misleading because it does exist, just in a different namespace. **How to find it:** Check Traefik pod logs: ```bash kubectl logs -n kube-system deployment/traefik | grep -i "does not exist" ``` **The fix — two required flags:** ``` --providers.kubernetescrd.allowExternalNameServices=true --providers.kubernetescrd.allowCrossNamespace=true ``` **The fix — ExternalName bridge service:** Even with flags enabled, using an ExternalName service as a namespace proxy is cleaner and more explicit about intent. **Why it matters:** Every Traefik v2 tutorial and most v3 tutorials do not mention this. If you are migrating from v2 or following older guides, you will hit this and spend hours wondering why a perfectly valid service cannot be found. ### 2. cert-manager + IngressRoute Silent Failure **What breaks:** cert-manager annotations on Traefik `IngressRoute` resources are completely ignored. **How it fails:** Silently. No error in cert-manager logs. No certificate request created. The TLS secret never appears. Traefik logs a TLS error referencing a missing secret. **The diagnosis pattern:** 1. Check if the secret exists: `kubectl get secret nginx-weighted-tls -n nginx-mcp` 2. If not: check cert-manager logs: `kubectl logs -n cert-manager deployment/cert-manager` 3. If cert-manager logs show nothing: the `Certificate` resource was never created **The fix:** Always create an explicit `Certificate` resource when using Traefik `IngressRoute`. Never rely on annotations. **Why it matters:** This is one of the most common questions in the Traefik community forums. The answer is always the same — IngressRoute does not trigger cert-manager. But it is nowhere in the official documentation as a prominent warning. ### 3. ArgoCD Force Sync — Why Polling Kills Demos ArgoCD polls the Git repository every 3 minutes by default. For live demos or rapid iteration, waiting 3 minutes for a change to deploy is unacceptable. **The fix** — force an immediate sync with a kubectl patch: ```bash kubectl patch application nginx-weighted -n argocd \ --type=merge \ -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{"revision":"HEAD","prune":true}}}' ``` This is what the AI runs after every Git commit during the demo. The sync completes in under 10 seconds. ### 4. HelmChartConfig — Making Traefik Flags Survive Upgrades K3s manages Traefik via a Helm chart. If you add flags by directly patching the Traefik deployment, those patches will be **overwritten** the next time K3s upgrades Traefik via Helm. The correct way to persist Traefik configuration in K3s: ```yaml apiVersion: helm.cattle.io/v1 kind: HelmChartConfig metadata: name: traefik namespace: kube-system spec: valuesContent: |- additionalArguments: - "--providers.kubernetescrd.allowExternalNameServices=true" - "--providers.kubernetescrd.allowCrossNamespace=true" ``` Apply this once and it survives all future K3s upgrades. Without it, you will find your canary routing mysteriously broken after the next K3s update. ### 5. The Configmap Name Mismatch Incident During a rebuild session, an old Gitea file from a previous deployment had `nginx-canary-config` as the ConfigMap name. The new deployment manifest expected `nginx-canary-html`. ArgoCD synced both files without complaint — both are valid Kubernetes resources. But the pods could not start. **The event in `kubectl describe pod`:** ``` Warning FailedMount MountVolume.SetUp failed for volume "html": configmap "nginx-canary-html" not found ``` **How the AI diagnosed it:** 1. Noticed pods stuck in `ContainerCreating` for longer than expected 2. Ran `kubectl describe pod nginx-canary-xxx -n nginx-canary` 3. Read the `Events` section — identified `configmap "nginx-canary-html" not found` 4. Checked the Gitea file — found `name: nginx-canary-config` in the ConfigMap metadata 5. Updated the Gitea file to use `nginx-canary-html` 6. Committed the fix, force-synced ArgoCD 7. Pods recovered — total time: under 2 minutes This is the incident response loop in miniature. Read events, identify root cause, fix in Git, sync, verify. No manual `kubectl edit`, no emergency `kubectl apply`. The fix went through Git — it is auditable, it is revertable, and it proves the process works under pressure. --- ## Part 10 — The GitOps Loop in Action ### Shifting Traffic With One Number The entire traffic management story comes down to two numbers in one file: `apps/nginx-weighted/traefikservice.yaml`. ```yaml spec: weighted: services: - name: nginx-mcp weight: 90 # ← change this - name: nginx-canary-proxy weight: 10 # ← and this ``` Change the numbers, commit to Gitea, ArgoCD applies within 30 seconds. No pod restarts. No configuration reload. No downtime. Traefik reads the updated `TraefikService` resource and adjusts its routing weights in real time. ### Drift Detection `selfHeal: true` in the ArgoCD application means the cluster state must always match Git. If someone runs `kubectl patch deployment nginx-mcp -n nginx-mcp --patch '{"spec":{"replicas":1}}'` directly on the cluster — bypassing GitOps entirely — ArgoCD detects the drift and reverts it to 3 replicas within approximately 2 seconds. This is not just a safety feature. For AI-managed infrastructure, it is essential. It means Git is not just the _intended_ source of truth — it is the _enforced_ source of truth. The AI cannot accidentally leave the cluster in an inconsistent state, because ArgoCD continuously corrects any deviation. ### The Broken Deployment Scenario During the experiment, a broken image tag was pushed to Git to simulate a real incident: ```yaml # Commit: "chore: simulate broken deployment for demo" image: nginx:99.99-broken-image ``` ArgoCD synced the change. The new pod attempted to pull `nginx:99.99-broken-image` from Docker Hub. The image did not exist. The pod entered `ImagePullBackOff`. **What did NOT happen:** The stable pods did not go down. Kubernetes rolling updates work by starting new pods before terminating old ones. Since the new pod never became Ready, the old pods were never terminated. All 3 replicas continued serving traffic throughout the incident. **The AI incident response:** - T+0s: Bad image committed to Git - T+30s: ArgoCD synced the broken manifest - T+45s: New pod detected in `ImagePullBackOff` - T+60s: AI ran `kubectl describe pod` on the failing pod - T+75s: Root cause identified: `ErrImagePull: manifest for nginx:99.99-broken-image not found` - T+80s: AI updated `deployment.yaml` in Gitea back to `nginx:1.25-alpine` - T+90s: Commit pushed, ArgoCD force-synced - T+120s: Broken pod terminated, 3/3 healthy pods serving traffic Zero downtime. Full audit trail in Git. No manual intervention. --- ## Part 11 — Honest Reflection ### What Worked Really Well **DNS operations** were flawless. Creating three Cloudflare A records in a single conversation, with correct TTL and proxy settings, took under 10 seconds. Faster than doing it manually in the Cloudflare dashboard. **AI diagnosis from kubectl describe** was genuinely impressive. Reading raw pod event output and correctly identifying a ConfigMap name mismatch — without being prompted to look there — shows the AI is doing more than pattern matching. **The PR workflow** worked exactly as designed. Feature branch, files committed, PR opened with a clear title and description, human merges, ArgoCD deploys. This is the correct production GitOps pattern and it worked end-to-end. **Speed.** The entire deployment — DNS, stable app, canary app, weighted routing layer, TLS certificate — from scratch to verified traffic split took under 10 minutes in a clean session. ### What Was Awkward or Failed **Leftover state from previous sessions** caused conflicts. If an earlier deployment left files in Gitea but the ArgoCD apps and namespaces were deleted, the next session would fail trying to create files that already exist. The AI had to detect this, check what existed, and work around it — which it handled, but it added friction. **Tool loading overhead.** MCP tools need to be loaded at the start of each conversation. The AI occasionally called a tool before it was loaded, got an error, called `tool_search` to load it, then retried. This added latency in the first few minutes of each session. **AI occasionally retried before diagnosing.** When the canary pods were stuck in `ContainerCreating`, the AI polled the pod status several times before running `kubectl describe` to find the actual error. Better behavior would be to describe immediately when a pod is not Running after a reasonable wait. **ArgoCD cascade deletion timing.** When deleting ArgoCD applications, the associated namespace resources (pods, services, configmaps) are cleaned up asynchronously. In a rapid rebuild scenario, trying to create the same namespace immediately after deletion sometimes required waiting. ### Is This Production-Ready? The **pattern** is production-ready. Weighted canary routing on a single domain, GitOps-enforced cluster state, PR-based change control, automatic TLS — all of these are exactly how production systems should work. The **AI-managed setup as-is** is not yet ready for unattended production use. What is missing: - **RBAC for MCP access** — currently the AI has cluster-admin equivalent permissions. In production, MCP server credentials should be scoped to the minimum necessary operations. - **Audit logging of AI actions** — there is no separate audit trail for what the AI did vs what a human did. Git history partially covers this but not operations that are not commits. - **Approval gates for destructive operations** — the AI can delete namespaces and DNS records. These should require explicit human confirmation, not just a PR merge. --- ## Part 12 — What's Next ### Prometheus Metrics → Automatic Weight Rollback The next logical evolution: connect Prometheus to the decision loop. If the canary error rate rises above a threshold while the weight is above zero, the AI automatically shifts weight back to 100/0 without waiting for human intervention. This closes the feedback loop from "deploy" to "monitor" to "rollback" entirely within the AI + GitOps system. ### Slack Integration During Incidents When the AI detects a broken deployment — ImagePullBackOff, CrashLoopBackOff, failed readiness probe — it should post to Slack immediately. Not a generic alert, but a specific message: what broke, what the root cause is, what fix is being applied. The same incident narrative the AI produces in the chat interface, delivered to the team channel in real time. ### Full PR-Only Workflow In the current setup, some operations (force sync, direct ArgoCD status checks) bypass Git. In a fully strict PR-only workflow, even weight changes would require a PR. The AI creates the branch, makes the change, opens the PR — and nothing happens until a human merges. This is appropriate for production where even a 10% traffic shift to an unvalidated canary is a business decision, not a technical one. ### Multi-Cluster Promotion The same pattern scales to multi-cluster promotion: dev → staging → prod. The AI manages weight gates at each stage. Promote to 100% canary in dev, open PR for staging deployment, human reviews and merges, weight testing begins in staging. Graduated promotion with human checkpoints at each environment boundary. ### The Bigger Picture MCP is becoming the standard interface between AI and the tools that run the world. Today's experiment shows one cluster, four MCP servers, one AI. The trajectory is clear: as more systems expose MCP interfaces — ticketing systems, monitoring platforms, cloud providers, security scanners — the AI becomes an orchestrator that can reason across all of them simultaneously. The engineer's role shifts. Less time writing YAML. More time reviewing what the AI proposed and deciding whether it is the right change. That is not a lesser role — it is arguably a more valuable one. --- ## Conclusion This experiment set out to answer one question: can AI manage real Kubernetes infrastructure conversationally, with proper GitOps discipline, and produce results you would be comfortable showing to a production engineering team? The answer is yes — with caveats. The Git history at **https://git.thedevops.dev/admin/k3s-gitops** shows every commit the AI made. The messages are clean. The YAML is correct. The PR workflow enforced human oversight at every step. The ArgoCD applications at **https://argocd.thedevops.dev** stayed Synced and Healthy. The traffic split worked exactly as configured, verified by running 30 requests and counting responses. What makes this interesting is not that AI can write Kubernetes YAML — it has been able to do that for years. What is interesting is that AI can now: - **Act on that YAML** through MCP servers, not just generate it - **Read real cluster state** and reason about what it means - **Diagnose real failures** from kubectl events and fix them through Git - **Coordinate across multiple systems** simultaneously — DNS, Git, cluster, sync engine The GitOps loop turned out to be a natural fit for AI operations. Git provides the audit trail. ArgoCD provides the enforcement. The PR workflow provides the human checkpoint. The AI provides the speed and the diagnosis. Each component does what it does best. **Test it yourself:** https://nginx.thedevops.dev Refresh the page several times. One of those refreshes will hit the canary. You will know it when you see orange. --- ## Appendix — Key Code Snippets ### A. ExternalName Proxy Service ```yaml apiVersion: v1 kind: Service metadata: name: nginx-canary-proxy namespace: nginx-mcp spec: type: ExternalName externalName: nginx-canary.nginx-canary.svc.cluster.local ports: - port: 80 ``` ### B. TraefikService Weighted Config ```yaml apiVersion: traefik.io/v1alpha1 kind: TraefikService metadata: name: nginx-weighted namespace: nginx-mcp spec: weighted: services: - name: nginx-mcp namespace: nginx-mcp port: 80 weight: 90 - name: nginx-canary-proxy namespace: nginx-mcp port: 80 weight: 10 ``` ### C. IngressRoute With TraefikService Reference ```yaml apiVersion: traefik.io/v1alpha1 kind: IngressRoute metadata: name: nginx-weighted namespace: nginx-mcp spec: entryPoints: - websecure routes: - match: Host(`nginx.thedevops.dev`) kind: Rule services: - name: nginx-weighted namespace: nginx-mcp kind: TraefikService # ← this is what references the weighted router tls: secretName: nginx-weighted-tls ``` ### D. Explicit Certificate Resource for cert-manager ```yaml apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: nginx-weighted-tls namespace: nginx-mcp spec: secretName: nginx-weighted-tls issuerRef: name: letsencrypt-http kind: ClusterIssuer dnsNames: - nginx.thedevops.dev ``` ### E. ArgoCD Force-Sync kubectl Patch Command ```bash kubectl patch application nginx-weighted -n argocd \ --type=merge \ -p '{"operation":{"initiatedBy":{"username":"admin"},"sync":{"revision":"HEAD","prune":true}}}' ``` ### F. HelmChartConfig for Persistent Traefik Flags (K3s) ```yaml apiVersion: helm.cattle.io/v1 kind: HelmChartConfig metadata: name: traefik namespace: kube-system spec: valuesContent: |- additionalArguments: - "--providers.kubernetescrd.allowExternalNameServices=true" - "--providers.kubernetescrd.allowCrossNamespace=true" ``` --- _All source files are available at: https://git.thedevops.dev/admin/k3s-gitops_ _Live demo: https://nginx.thedevops.dev_ ## Completion Report ### ✅ Deployment Complete #### DNS Records Created |Hostname|IP|Proxied| |---|---|---| |nginx.thedevops.dev|62.169.30.194|false| |nginx-canary.thedevops.dev|62.169.30.194|false| |nginx-stable.thedevops.dev|62.169.30.194|false| --- #### Gitea Pull Requests Merged |PR|Branch|Title|Files| |---|---|---|---| |#2|feat/nginx-mcp-stable|deploy nginx-mcp stable app (v1, 3 replicas)|5| |#3|feat/nginx-canary|deploy nginx-canary app (v2.0.0, 2 replicas)|5| |#4|feat/nginx-weighted|deploy nginx-weighted traffic layer (90/10, TLS)|6| #### Gitea Files Created |App|File|Purpose| |---|---|---| |nginx-mcp|namespace.yaml|Namespace, label track=stable| |nginx-mcp|configmap.yaml|Dark purple HTML, v1 stable badge, IP via ipify| |nginx-mcp|deployment.yaml|3 replicas, nginx:1.25-alpine, probes, resource limits| |nginx-mcp|service.yaml|ClusterIP port 80| |nginx-mcp|application.yaml|ArgoCD app, automated sync, prune, selfHeal| |nginx-canary|namespace.yaml|Namespace, label track=canary| |nginx-canary|configmap.yaml|Dark orange HTML, v2 badge, ⚠ warning, info grid, X-Version/X-Track headers| |nginx-canary|deployment.yaml|2 replicas, nginx:1.25-alpine, probes, resource limits| |nginx-canary|service.yaml|ClusterIP port 80| |nginx-canary|application.yaml|ArgoCD app, automated sync, prune, selfHeal| |nginx-weighted|canary-proxy-svc.yaml|ExternalName bridge nginx-mcp → nginx-canary namespace| |nginx-weighted|traefikservice.yaml|Weighted router 90% stable / 10% canary| |nginx-weighted|certificate.yaml|cert-manager TLS for nginx.thedevops.dev| |nginx-weighted|middleware.yaml|HTTP → HTTPS permanent redirect| |nginx-weighted|ingressroute.yaml|IngressRoute websecure (TraefikService) + web (redirect)| |nginx-weighted|application.yaml|ArgoCD app, destination nginx-mcp namespace| --- #### Kubernetes Resources |Namespace|Resource|Name|Status| |---|---|---|---| |nginx-mcp|Namespace|nginx-mcp|✅ Active| |nginx-mcp|Deployment|nginx-mcp|✅ 3/3 Ready| |nginx-mcp|Service|nginx-mcp|✅ ClusterIP| |nginx-mcp|Service|nginx-canary-proxy|✅ ExternalName| |nginx-mcp|TraefikService|nginx-weighted|✅ Active (90/10)| |nginx-mcp|IngressRoute|nginx-weighted|✅ websecure| |nginx-mcp|IngressRoute|nginx-weighted-http|✅ web → https| |nginx-mcp|Middleware|redirect-https|✅ Active| |nginx-mcp|Certificate|nginx-weighted-tls|✅ Ready| |nginx-mcp|Secret|nginx-weighted-tls|✅ kubernetes.io/tls| |nginx-canary|Namespace|nginx-canary|✅ Active| |nginx-canary|Deployment|nginx-canary|✅ 2/2 Ready| |nginx-canary|Service|nginx-canary|✅ ClusterIP| --- #### ArgoCD Applications |App|Sync Status|Health|Revision| |---|---|---|---| |nginx-mcp|Synced|Healthy|a09690fe| |nginx-canary|Synced|Healthy|a09690fe| |nginx-weighted|Synced|Healthy|a09690fe| --- #### Traffic Split Verification 30 requests to `https://nginx.thedevops.dev` from inside the cluster: |Version|Hits|Percentage| |---|---|---| |stable (purple, v1)|26|87%| |canary (orange, v2)|4|13%| _Probabilistic routing — converges to 90/10 over larger sample sizes._ --- #### How to change traffic weights File: `apps/nginx-weighted/traefikservice.yaml` → edit on a branch → open PR → merge → ArgoCD applies in ~30s |Scenario|stable weight|canary weight| |---|---|---| |Initial canary test|90|10| |Extended testing|50|50| |Full promote to canary|0|100| |Emergency rollback|100|0| --- #### Architecture Summary Three Pull Requests were reviewed and merged by a human before anything reached the cluster. After each merge, ArgoCD detected the new commit on `main` and synced the cluster automatically. Traffic arrives at `https://nginx.thedevops.dev` and is terminated by Traefik with a Let's Encrypt TLS certificate managed by cert-manager. The `IngressRoute` hands the request to the `TraefikService` named `nginx-weighted`, which distributes traffic by weight — 90% to the stable `nginx-mcp` deployment (3 pods, purple UI, v1) and 10% to `nginx-canary-proxy`, an ExternalName service that bridges into the `nginx-canary` namespace (2 pods, orange UI, v2). To shift traffic, edit the two weight numbers in `traefikservice.yaml`, open a PR, merge — no pod restarts, no downtime. **Live demo:** [https://nginx.thedevops.dev](https://nginx.thedevops.dev) — refresh several times to catch the canary 🟠