Network Troubleshooting for DevOps Engineers. A Practical Guide

> A production-grade handbook for DevOps Engineers, SREs, Platform Engineers, and Kubernetes Engineers ## 1. Introduction Network issues cause the majority of production outages. Whether it's a microservice that can't reach its database, a Kubernetes pod stuck in `CrashLoopBackOff` due to a DNS failure, or a load balancer silently dropping traffic — understanding how to navigate these problems quickly is what separates a junior engineer from a senior one. What makes network troubleshooting uniquely challenging is the number of layers involved. A single failed HTTP request can originate from a misconfigured security group in AWS, a broken CoreDNS pod in Kubernetes, a firewall rule added by a teammate, a wrong route in an OS routing table, or a misconfigured application binding to the wrong interface. Without a systematic approach, you're guessing. **Common real-world scenarios you'll encounter:** - **Service unreachable** — `curl` returns `Connection refused` or times out - **Kubernetes pod communication failure** — pods can't reach each other or services - **DNS failures** — names don't resolve, or resolve to wrong IPs - **Load balancer misconfiguration** — traffic reaches the LB but never hits backend pods - **Firewall blocking traffic** — connectivity works from some hosts but not others **The layered troubleshooting mindset** is everything. Always start at the bottom (physical/IP connectivity) and work your way up (application). Jumping straight to application logs when the real problem is a firewall rule costs hours. ``` Application layer → Is the app listening? Is it returning errors? Transport layer → Is the port open? Are connections being established? Network layer → Can packets route between hosts? Data link/physical → Is there connectivity at all? ``` --- ## 2. How Networking Works: A DevOps Perspective ### OSI vs TCP/IP — What Actually Matters Forget memorizing 7 OSI layers for an exam. In production, you work with 4 practical layers: |Layer|Protocol|Your Tools| |---|---|---| |Application|HTTP, gRPC, DNS, TLS|curl, dig, openssl| |Transport|TCP, UDP|ss, netstat, nc| |Network|IP, ICMP|ping, traceroute, ip route| |Link/Physical|Ethernet, ARP|ip link, arp, ethtool| ### What Happens When You Run `curl https://example.com` Understanding this flow tells you exactly where to look when things break. ``` ┌─────────────────────────────────────────────────────────────────┐ │ curl https://example.com │ │ │ │ 1. DNS Resolution │ │ └─ Check /etc/hosts → /etc/nsswitch.conf → resolv.conf │ │ → Query DNS server (e.g., 8.8.8.8) │ │ → Returns: 93.184.216.34 │ │ │ │ 2. Routing Decision │ │ └─ Check routing table: which interface to use? │ │ → Selects eth0, gateway 192.168.1.1 │ │ │ │ 3. TCP Handshake (port 443) │ │ └─ SYN → server │ │ SYN-ACK ← server │ │ ACK → server │ │ │ │ 4. TLS Handshake │ │ └─ ClientHello → server │ │ ServerHello + Certificate ← server │ │ Key exchange, session established │ │ │ │ 5. HTTP Request │ │ └─ GET / HTTP/1.1 │ │ Host: example.com │ │ │ │ 6. Response │ │ └─ HTTP/1.1 200 OK │ └─────────────────────────────────────────────────────────────────┘ ``` **Where failures occur at each step:** - **DNS** → `could not resolve host` — check `/etc/resolv.conf`, test with `dig` - **Routing** → `Network unreachable` — check `ip route`, gateway, VPC routing tables - **TCP handshake** → `Connection refused` (port closed) or timeout (firewall dropping) - **TLS** → `SSL certificate verify failed`, expired cert, wrong hostname - **HTTP** → 4xx/5xx errors, application-level problems --- ## 3. Systematic Troubleshooting Methodology Gut-feel debugging is slow and inconsistent. A structured approach cuts your mean time to resolution (MTTR) dramatically. ### The Six-Step Framework **Step 1: Define the problem precisely** Before touching a single tool, answer: - What is the exact error message? - What is the source (client IP/pod/service)? - What is the destination (IP/hostname/port/protocol)? - When did it start? What changed? - Is it affecting all requests or some? ```bash # Gather basic context hostname && ip addr show date && uptime last reboot ``` **Step 2: Verify basic connectivity (Layer 3)** ```bash # Can you reach the host at all? ping -c 4 <target-ip> # What's the path? traceroute <target-ip> ``` If `ping` fails to an IP: routing or firewall issue. If `ping` succeeds but everything else fails: higher-layer problem. **Step 3: Verify DNS resolution** ```bash dig <hostname> dig @8.8.8.8 <hostname> # Bypass local resolver nslookup <hostname> ``` If the hostname doesn't resolve, or resolves to the wrong IP, you've found your problem. **Step 4: Check routing** ```bash ip route get <target-ip> # Shows which route will be used ip route show # Full routing table ``` **Step 5: Check firewall and port accessibility** ```bash nc -zv <target-ip> <port> # Is the port reachable? telnet <target-ip> <port> ss -tulnp # Is the service listening locally? iptables -L -n -v # Any rules blocking traffic? ``` **Step 6: Check the application layer** ```bash curl -v http://<target>:<port>/health curl -vvv --resolve <host>:<port>:<ip> https://<host>/path journalctl -u <service> --since "10 min ago" kubectl logs <pod> --previous ``` ### Decision Tree ``` Is ping to target IP working? ├── NO → Routing/firewall issue │ ├── Check: ip route get <ip> │ ├── Check: iptables -L │ └── Check: Cloud security groups └── YES → Is DNS resolving correctly? ├── NO → DNS issue │ ├── Check: /etc/resolv.conf │ ├── Check: dig @<nameserver> <host> │ └── K8s: check CoreDNS pods └── YES → Is the port open? ├── NO → Service not running or firewall blocking │ ├── Check: ss -tulnp │ └── Check: nc -zv host port └── YES → Application layer issue ├── Check: curl -v ├── Check: app logs └── Check: TLS certificate ``` --- ## 4. Essential Linux Network Troubleshooting Tools ### ping — Baseline Connectivity **What it does:** Sends ICMP echo requests to verify Layer 3 reachability and measure round-trip time. **When to use:** First check in any troubleshooting session. ```bash ping -c 4 8.8.8.8 # 4 packets to Google DNS ping -c 4 google.com # Tests both DNS and connectivity ping -I eth0 10.0.0.5 # Force specific interface ping -s 1400 10.0.0.5 # Test with larger packet (MTU issues) ``` **Interpreting output:** ``` PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data. 64 bytes from 8.8.8.8: icmp_seq=1 ttl=118 time=12.3 ms 64 bytes from 8.8.8.8: icmp_seq=2 ttl=118 time=11.8 ms --- 8.8.8.8 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1001ms rtt min/avg/max/mdev = 11.8/12.0/12.3/0.2 ms ``` - `ttl` decreasing across attempts = possible routing asymmetry - High `mdev` (variance) = network instability - 100% packet loss to an IP that should be reachable = firewall dropping ICMP or host down --- ### traceroute / tracepath — Path Analysis **What it does:** Shows every network hop between you and the destination. Identifies where packets stop. ```bash traceroute 8.8.8.8 traceroute -T -p 443 api.example.com # TCP-based traceroute (bypasses ICMP blocks) tracepath 8.8.8.8 # Similar, no root required mtr --report 8.8.8.8 # Real-time, combines ping+traceroute ``` **Production use:** If traceroute shows packets reaching hop 7 but never hop 8, the problem is between those two nodes — which might be a cloud router, firewall, or misconfigured VPN gateway. --- ### ip — Interface and Route Management **What it does:** The modern replacement for `ifconfig` and `route`. Manages interfaces, addresses, routes, and more. ```bash ip addr show # All interfaces and IPs ip addr show eth0 # Specific interface ip route show # Routing table ip route get 10.0.1.5 # Which route would be used? ip link show # Interface state (UP/DOWN) ip neigh show # ARP table ``` **Real-world example:** ```bash $ ip route get 10.96.0.1 10.96.0.1 via 192.168.1.1 dev eth0 src 192.168.1.50 uid 0 cache ``` This tells you: traffic to `10.96.0.1` (Kubernetes ClusterIP) goes via gateway `192.168.1.1` on `eth0`. If you expect it to go through a Kubernetes CNI interface instead, something is misconfigured. --- ### ss — Socket Statistics **What it does:** Shows open sockets, listening ports, and established connections. Faster and more powerful than `netstat`. ```bash ss -tulnp # TCP+UDP, listening only, with process names ss -tnp state established # All established TCP connections ss -s # Summary statistics ss -tulnp | grep :443 # Who is listening on 443? ss -tnp dst 10.0.0.5 # Connections to specific destination ``` **Interpreting output:** ``` Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process tcp LISTEN 0 128 0.0.0.0:80 0.0.0.0:* users:(("nginx",pid=1234,fd=6)) tcp ESTAB 0 0 10.0.0.10:54312 10.0.0.5:443 users:(("curl",pid=5678,fd=5)) ``` - `Recv-Q` nonzero on LISTEN: app is not accepting connections fast enough (backlog full) - `Send-Q` nonzero on ESTAB: network congestion, destination not consuming data --- ### curl — HTTP/HTTPS Testing **What it does:** The Swiss Army knife for testing HTTP endpoints. ```bash curl -v http://service:8080/health # Verbose output curl -I https://example.com # Headers only curl -w "\nTime: %{time_total}s\n" https://example.com # Timing curl --connect-timeout 5 --max-time 10 http://slow-service/ curl -k https://self-signed.example.com # Skip TLS verification curl --resolve api.example.com:443:10.0.0.5 https://api.example.com # Override DNS curl -H "Host: myapp.example.com" http://10.0.0.5/ # Test ingress with custom Host header ``` **The timing breakdown is gold for diagnosing slow requests:** ```bash curl -w "DNS: %{time_namelookup}s | Connect: %{time_connect}s | TLS: %{time_appconnect}s | Total: %{time_total}s\n" \ -o /dev/null -s https://example.com ``` --- ### dig — DNS Interrogation **What it does:** Queries DNS servers directly. The primary tool for DNS troubleshooting. ```bash dig google.com # A record dig google.com AAAA # IPv6 record dig google.com MX # Mail records dig @8.8.8.8 google.com # Query specific nameserver dig +short google.com # IP only dig +trace google.com # Full resolution chain dig -x 8.8.8.8 # Reverse DNS dig @10.96.0.10 kubernetes.default.svc.cluster.local # K8s CoreDNS ``` **Reading dig output:** ``` ;; ANSWER SECTION: google.com. 299 IN A 142.250.80.46 ;; Query time: 12 msec ;; SERVER: 8.8.8.8#53 ``` - `299` = TTL in seconds (low TTL = DNS changes propagate quickly) - `SERVER` = which resolver actually answered - No ANSWER section = DNS record doesn't exist or resolution failed --- ### tcpdump — Packet Capture **What it does:** Captures and analyzes raw network packets. The ultimate source of truth. ```bash tcpdump -i eth0 # All traffic on eth0 tcpdump -i any port 443 # All TLS traffic, any interface tcpdump host 10.0.0.5 # Traffic to/from specific host tcpdump src 10.0.0.5 and dst port 8080 # Filtered tcpdump -w /tmp/capture.pcap -i eth0 # Save to file (analyze in Wireshark) tcpdump -i eth0 -nn -v port 53 # DNS queries, no hostname resolution tcpdump 'tcp[tcpflags] & (tcp-syn) != 0' # SYN packets only ``` **Reading a TCP handshake:** ``` 14:23:01 IP 10.0.0.10.54312 > 10.0.0.5.443: Flags [S], seq 1234567890 14:23:01 IP 10.0.0.5.443 > 10.0.0.10.54312: Flags [S.], seq 9876543210, ack 1234567891 14:23:01 IP 10.0.0.10.54312 > 10.0.0.5.443: Flags [.], ack 9876543211 ``` - `[S]` = SYN (initiating connection) - `[S.]` = SYN-ACK (server acknowledging) - `[.]` = ACK (connection established) - `[R]` = RST (connection refused/reset — port closed or firewall rejecting) - `[F]` = FIN (graceful close) If you see SYN packets leaving but no SYN-ACK arriving, a firewall is dropping packets. --- ### nc (netcat) — Swiss Army Knife for TCP/UDP **What it does:** Opens raw TCP/UDP connections, useful for port testing and basic service simulation. ```bash nc -zv 10.0.0.5 443 # Test if port is open (verbose) nc -zv 10.0.0.5 8080-8090 # Scan port range nc -l 8080 # Listen on port 8080 (simple server) echo "GET / HTTP/1.0" | nc 10.0.0.5 80 # Raw HTTP request nc -u 10.0.0.5 514 # UDP test (syslog port) ``` --- ### nmap — Network Scanner **What it does:** Scans ports and services across one or many hosts. ```bash nmap -p 443,8080,8443 10.0.0.5 # Scan specific ports nmap -p- 10.0.0.5 # All 65535 ports nmap -sV 10.0.0.5 # Version detection nmap -sn 10.0.0.0/24 # Host discovery (no port scan) nmap --script ssl-cert 10.0.0.5 -p 443 # Check TLS certificate ``` **Note:** Use with permission. In production environments, coordinate scans to avoid triggering security alerts. --- ## 5. Troubleshooting DNS Problems DNS issues cause a disproportionate number of production incidents, and they're often subtle — everything looks fine until a TTL expires or a pod restarts. ### How DNS Resolution Works on Linux ``` Application → glibc resolver → Check /etc/nsswitch.conf (order: files, dns) → Check /etc/hosts (files) → Query /etc/resolv.conf nameserver(s) └── systemd-resolved (127.0.0.53) on modern systems └── Upstream DNS (8.8.8.8, or corporate DNS) ``` ```bash cat /etc/resolv.conf # nameserver 127.0.0.53 <- systemd-resolved stub # nameserver 10.96.0.10 <- K8s CoreDNS (inside pods) # search default.svc.cluster.local svc.cluster.local cluster.local # options ndots:5 ``` ### Common DNS Issues and How to Debug Them **Issue: DNS timeout** ```bash # How long does resolution take? time dig google.com # Is the resolver responding? dig @127.0.0.53 google.com dig @8.8.8.8 google.com # Bypass local resolver # Check systemd-resolved status systemd-resolve --status resolvectl status ``` **Issue: Wrong DNS server configured** ```bash # See what resolver is being used cat /etc/resolv.conf # On systemd-resolved systems resolvectl dns # Override for a single query dig @10.0.0.1 internal.service.corp ``` **Issue: DNS works for public names but not internal** ```bash # Check search domains cat /etc/resolv.conf | grep search # Manually test internal name with full FQDN dig internal.service.corp. dig internal.service.corp # Note trailing dot matters ``` ### Kubernetes CoreDNS Inside Kubernetes pods, DNS is handled by CoreDNS at the `kube-dns` ClusterIP (typically `10.96.0.10`). ```bash # Verify CoreDNS pods are running kubectl -n kube-system get pods -l k8s-app=kube-dns # Test DNS from inside a pod kubectl run -it --rm debug --image=busybox --restart=Never -- sh # Inside pod: nslookup kubernetes.default nslookup myservice.mynamespace.svc.cluster.local cat /etc/resolv.conf # Check CoreDNS logs kubectl -n kube-system logs -l k8s-app=kube-dns --tail=50 # Check CoreDNS ConfigMap kubectl -n kube-system get configmap coredns -o yaml ``` **The `ndots:5` setting explained:** In Kubernetes, short names like `myservice` trigger up to 5 search domain attempts before falling back to the root. This means `myservice` expands to `myservice.default.svc.cluster.local`, then `myservice.svc.cluster.local`, etc. This can cause DNS timeouts when hitting external names — consider using FQDNs with a trailing dot for external endpoints. --- ## 6. Troubleshooting Connectivity Issues ### Localhost Issues ```bash # Is the service bound to the right interface? ss -tulnp | grep <port> # Service bound to 127.0.0.1 won't be reachable externally # Service bound to 0.0.0.0 listens on all interfaces ``` If a service is bound to `127.0.0.1:8080` and you're trying to reach it from another host — that's your problem. Check the application configuration to bind to `0.0.0.0` or the specific external IP. ### Server-to-Server Connectivity ```bash # From source server, test destination ping <destination-ip> nc -zv <destination-ip> <port> curl -v http://<destination-ip>:<port>/health # Check routing ip route get <destination-ip> # Example output $ ip route get 10.0.1.50 10.0.1.50 via 10.0.0.1 dev eth0 src 10.0.0.10 uid 0 ``` **Reading the routing table:** ```bash $ ip route show default via 192.168.1.1 dev eth0 proto dhcp 10.0.0.0/8 via 10.10.0.1 dev vpn0 # Internal traffic via VPN 172.16.0.0/12 via 10.10.0.1 dev vpn0 192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.50 ``` Traffic to `10.0.1.50` matches the `/8` route and goes via the VPN. If that VPN tunnel is down, connection fails even though the host is physically reachable. ### Container Networking Issues In Docker/Kubernetes, containers have their own network namespace with separate interfaces and routes. ```bash # Docker: inspect container network docker inspect <container> | grep -i network docker exec <container> ip addr docker exec <container> ip route docker exec <container> cat /etc/resolv.conf # Check Docker bridge network ip link show docker0 bridge link show ``` --- ## 7. Troubleshooting Ports and Services ### Is the Service Listening? ```bash ss -tulnp # All listening sockets with process ss -tulnp | grep :8080 # Specific port ss -tulnp | grep nginx # Specific process # If ss not available (old systems) netstat -tulnp netstat -tulnp | grep LISTEN ``` ### Is the Port Reachable Remotely? ```bash nc -zv 10.0.0.5 8080 # Quick port test nc -zv -w 3 10.0.0.5 8080 # 3 second timeout # Test from within Kubernetes pod kubectl exec -it <pod> -- nc -zv <service-name> <port> kubectl exec -it <pod> -- wget -qO- http://<service>:<port>/health ``` **Interpreting nc results:** ``` Connection to 10.0.0.5 8080 port [tcp/http-alt] succeeded! # Port open nc: connectx to 10.0.0.5 port 8080 (tcp) failed: Connection refused # Port closed # (hangs/timeout) = firewall dropping packets silently ``` The difference between "Connection refused" and a timeout is critical: - **Refused** = host is reachable, but nothing is listening on that port (or iptables REJECT) - **Timeout** = packets are being dropped (firewall DROP rule, routing issue, host unreachable) --- ## 8. Firewall Troubleshooting ### iptables ```bash # View all rules with packet counts iptables -L -n -v # View NAT table (important for K8s/Docker) iptables -t nat -L -n -v # View filter table explicitly iptables -t filter -L INPUT -n -v --line-numbers # Check if a specific port is blocked iptables -L INPUT -n | grep DROP iptables -L INPUT -n | grep REJECT ``` **Understanding chains:** - **INPUT** — traffic destined for this host - **OUTPUT** — traffic originating from this host - **FORWARD** — traffic passing through this host (relevant for routers, K8s nodes) - **PREROUTING** (nat table) — DNAT happens here (e.g., K8s service VIP → pod IP) - **POSTROUTING** (nat table) — SNAT/masquerade happens here **Real-world K8s iptables example:** When you access a Kubernetes ClusterIP service, iptables intercepts the traffic and rewrites the destination to a backend pod IP using DNAT rules created by kube-proxy: ```bash # See K8s service rules iptables -t nat -L KUBE-SERVICES -n -v | grep 10.96.0.10 iptables -t nat -L KUBE-SVC-<hash> -n -v # Trace a packet through iptables (kernel module) modprobe xt_LOG iptables -t raw -I PREROUTING -p tcp --dport 8080 -j LOG --log-prefix "PKT: " # Watch: dmesg | grep PKT # Clean up: iptables -t raw -D PREROUTING -p tcp --dport 8080 -j LOG --log-prefix "PKT: " ``` ### nftables nftables replaces iptables on modern distributions (RHEL 8+, Debian 10+): ```bash nft list ruleset # All rules nft list table inet filter # Filter table nft list chain inet filter input # Input chain ``` ### ufw (Ubuntu Firewall) ```bash ufw status verbose # Current rules and status ufw allow 8080/tcp # Allow port ufw deny from 10.0.0.5 # Block source IP ufw logging on # Enable logging (/var/log/ufw.log) ``` --- ## 9. Packet-Level Troubleshooting with tcpdump tcpdump is your ground truth. When you can't trust what the application says, packets don't lie. ### Confirming Traffic Reaches the Server ```bash # On the server, capture incoming connections tcpdump -i any -nn port 8080 # On the server, filter for specific client tcpdump -i any -nn src 10.0.0.10 and port 8080 # Capture and save for later analysis tcpdump -i eth0 -w /tmp/debug.pcap port 443 # Transfer to your laptop and open in Wireshark ``` ### Diagnosing Connection Failures **Scenario: Client sends SYN, gets no response** ```bash # On client tcpdump -i eth0 host 10.0.0.5 and port 8080 # See: SYN packets going out, nothing coming back # Conclusion: Firewall is dropping packets (DROP rule, security group) # On server tcpdump -i eth0 port 8080 # If SYN packets don't appear here: firewall before server # If SYN packets appear but no SYN-ACK: server-side issue (app not listening, server firewall) ``` **Scenario: Connection established but no data** ```bash tcpdump -i any -nn -A host 10.0.0.5 and port 8080 # -A prints ASCII content # Look for HTTP request/response or lack thereof ``` **Advanced tcpdump filters:** ```bash # Only SYN packets (new connections) tcpdump 'tcp[tcpflags] & tcp-syn != 0' # Only RST packets (connection resets) tcpdump 'tcp[tcpflags] & tcp-rst != 0' # HTTP GET requests tcpdump -A -s 0 'tcp dst port 80 and tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420' # Large packets (MTU debugging) tcpdump 'ip[2:2] > 1400' # DNS queries tcpdump -i any -nn port 53 ``` --- ## 10. Kubernetes Network Troubleshooting Kubernetes networking has multiple layers: pod networking, service networking, and ingress. Each can fail independently. ### Pod Networking Fundamentals Every pod gets its own IP (managed by CNI: Calico, Flannel, Cilium, etc.). Pods can communicate directly via IP across nodes — if CNI is working correctly. ```bash # Get pod IPs kubectl get pods -o wide -n <namespace> # Check which node a pod is on kubectl get pod <pod> -o wide # Test connectivity from inside a pod kubectl exec -it <pod> -- ping <other-pod-ip> kubectl exec -it <pod> -- nc -zv <service-name> <port> kubectl exec -it <pod> -- wget -qO- http://<service>:<port>/health ``` ### Service Networking Kubernetes Services create virtual IPs (ClusterIP) and route traffic to matching pods via iptables/IPVS rules set up by kube-proxy. ```bash # Inspect a service kubectl get svc <service-name> -o wide kubectl describe svc <service-name> # Verify endpoints exist (if Endpoints is empty, no pods match the selector) kubectl get endpoints <service-name> kubectl describe endpoints <service-name> ``` **Empty Endpoints is the #1 cause of "Service unreachable" in Kubernetes.** This means no pods match the service selector. Check: ```bash # What selector does the service use? kubectl get svc <service> -o jsonpath='{.spec.selector}' # Do any pods match? kubectl get pods -l app=myapp # Replace with your selector labels kubectl get pods --show-labels | grep <label> ``` ### Debugging with Ephemeral Containers ```bash # Run a debug pod in same namespace kubectl run debug-pod --image=nicolaka/netshoot -it --rm --restart=Never -- bash # Inside netshoot: dig, curl, tcpdump, iperf all available # Attach to running pod's network namespace (K8s 1.23+) kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container> ``` ### Ingress Troubleshooting ```bash # Check ingress configuration kubectl get ingress -A kubectl describe ingress <name> # Check ingress controller pods kubectl -n ingress-nginx get pods kubectl -n ingress-nginx logs <ingress-pod> --tail=50 # Test with explicit Host header curl -H "Host: myapp.example.com" http://<ingress-controller-ip>/path # Test TLS openssl s_client -connect myapp.example.com:443 -servername myapp.example.com ``` ### CNI Issues If pods on different nodes can't communicate: ```bash # Check CNI pods (Calico example) kubectl -n kube-system get pods -l k8s-app=calico-node # Check node-level routes ip route show | grep <pod-cidr> # Verify CNI interface exists ip link show cali* # Calico ip link show flannel* # Flannel ip link show cilium* # Cilium # Check for CNI config ls /etc/cni/net.d/ cat /etc/cni/net.d/10-calico.conflist ``` --- ## 11. Cloud Network Troubleshooting ### AWS **Security Groups** are the most common source of connectivity problems in AWS. Unlike iptables, they are stateful — allowing inbound automatically allows return traffic. ```bash # From the AWS CLI aws ec2 describe-security-groups --group-ids sg-xxxxxxxx aws ec2 describe-network-acls --filters Name=vpc-id,Values=vpc-xxxxxxxx # Check effective security groups on an instance aws ec2 describe-instances --instance-ids i-xxxxxxxx \ --query 'Reservations[].Instances[].SecurityGroups' ``` **Common AWS network gotchas:** - Security Group allows port 8080, but the **application is binding to 127.0.0.1** — packets arrive but are rejected by OS - **NACLs are stateless** — you need both inbound AND outbound rules (unlike Security Groups) - **VPC Peering** is not transitive — A peers with B, B peers with C ≠ A can reach C - **Route tables** — subnets need explicit routes to reach peered VPCs, VPN gateways, etc. ```bash # Test from EC2 instance curl http://169.254.169.254/latest/meta-data/ # Instance metadata (verify IMDSv2) curl http://169.254.169.254/latest/meta-data/local-ipv4 # VPC Flow Logs — enable on suspect subnets, then query CloudWatch Logs # Look for REJECT action on expected traffic ``` ### GCP ```bash # Check firewall rules gcloud compute firewall-rules list gcloud compute firewall-rules describe <rule-name> # Check routes gcloud compute routes list # VPC network details gcloud compute networks describe <network-name> ``` **GCP-specific gotchas:** Firewall rules apply to the entire VPC network, not subnets. Target tags or service accounts control which VMs the rule applies to. ### Azure In Azure, **Network Security Groups (NSGs)** can be attached at both the subnet level and the NIC level — both are evaluated. A common mistake is configuring the NIC NSG but forgetting the subnet NSG, or vice versa. ```bash az network nsg show -g <resource-group> -n <nsg-name> az network nsg rule list -g <resource-group> --nsg-name <nsg-name> az network nic show -g <resource-group> -n <nic-name> ``` --- ## 12. Real Production Incident Walkthrough ### Scenario: "Payment Service Unreachable After Deployment" **Alert received:** `payment-service` health check failing. 0% success rate for 5 minutes. **Step 1: Define the problem** ```bash kubectl get pods -n payments # NAME READY STATUS RESTARTS AGE # payment-svc-7d9f8b6-xk2pq 0/1 Running 0 3m # payment-svc-7d9f8b6-mn8qt 0/1 Running 0 3m ``` Pods are running but not READY. Something is failing the readiness probe. **Step 2: Check events and logs** ```bash kubectl describe pod payment-svc-7d9f8b6-xk2pq -n payments # Events: # Warning Unhealthy 2m kubelet Readiness probe failed: Get "http://10.0.1.45:8080/health": dial tcp 10.0.1.45:8080: connect: connection refused kubectl logs payment-svc-7d9f8b6-xk2pq -n payments --tail=30 # Error: Cannot connect to database: dial tcp 10.96.45.12:5432: i/o timeout ``` App is running but can't reach its database. **Step 3: DNS and service check** ```bash kubectl exec -it payment-svc-7d9f8b6-xk2pq -n payments -- sh # Inside pod: nslookup postgres-service.databases.svc.cluster.local # Server: 10.96.0.10 # Non-authoritative answer: Name: postgres-service.databases.svc.cluster.local # Address: 10.96.45.12 ``` DNS resolves correctly. **Step 4: Test connectivity** ```bash # Still inside pod nc -zv 10.96.45.12 5432 # (hangs — timeout, not refused) ``` Port times out. Either the service has no endpoints, or a NetworkPolicy is blocking it. **Step 5: Check endpoints** ```bash kubectl get endpoints postgres-service -n databases # NAME ENDPOINTS AGE # postgres-service <none> 45m ``` **No endpoints!** The service has no backing pods. **Step 6: Find the root cause** ```bash kubectl get pods -n databases # NAME READY STATUS RESTARTS AGE # postgres-0 0/1 ImagePullBackOff 0 46m ``` The database pod failed to start due to `ImagePullBackOff`. During the deployment, someone updated the database image tag in the Helm values and pushed an image that doesn't exist in the registry. **Resolution:** ```bash # Fix the image tag helm upgrade postgres ./charts/postgres -n databases --set image.tag=15.3 # Verify pod comes up kubectl get pods -n databases -w # Verify endpoints populate kubectl get endpoints postgres-service -n databases # NAME ENDPOINTS AGE # postgres-service 10.0.1.82:5432 2m # Verify payment service recovers kubectl get pods -n payments ``` **Total resolution time: 11 minutes.** The structured approach — checking events, logs, DNS, connectivity, endpoints in sequence — avoided hours of guessing. --- ## 13. Advanced Troubleshooting Techniques ### conntrack — Connection Tracking The Linux connection tracking table records all NAT'd connections. Useful for debugging K8s service routing and SNAT issues. ```bash conntrack -L # List all tracked connections conntrack -L | grep 10.0.0.5 # Filter by IP conntrack -L | wc -l # Total tracked connections # If this is near nf_conntrack_max, you'll drop connections cat /proc/sys/net/netfilter/nf_conntrack_count cat /proc/sys/net/netfilter/nf_conntrack_max # Watch new connections in real-time conntrack -E -e NEW ``` **High conntrack count is a real production issue.** Under heavy load, it can exhaust the conntrack table, causing new connections to silently fail with no error. ### Network Namespaces Containers and pods have isolated network namespaces. To troubleshoot at the packet level inside a container without installing tools in the container: ```bash # Find the container PID docker inspect <container> | grep Pid # Or for K8s crictl inspect <container-id> | grep pid # Enter the network namespace nsenter -t <pid> -n -- ip addr show nsenter -t <pid> -n -- ss -tulnp nsenter -t <pid> -n -- tcpdump -i any port 8080 ``` ### Advanced ss Filters ```bash # Show only connections in TIME_WAIT (can indicate connection storm) ss -tn state time-wait | wc -l # Show sockets by memory usage (find memory hog) ss -tm # Connections to a specific destination port ss -tn dst :443 # Filter by source address ss -tn src 10.0.0.5 ``` ### strace for Socket Debugging When you need to know exactly what syscalls an application makes: ```bash strace -e trace=network -p <pid> strace -e connect,bind,sendto,recvfrom curl http://example.com ``` This shows every `connect()` call, which IP:port the app is trying to reach, and what errors it receives — invaluable when the app logs are ambiguous. --- ## 14. Automation and Monitoring The best network troubleshooting is the one you don't have to do because your monitoring caught the issue first. ### Key Metrics to Monitor with Prometheus ```yaml # Key network metrics to alert on: # Blackbox exporter — probe availability probe_success{job="blackbox", instance="https://api.example.com"} == 0 # Node exporter — interface errors rate(node_network_receive_errs_total[5m]) > 0 rate(node_network_transmit_errs_total[5m]) > 0 # DNS resolution failures (CoreDNS) rate(coredns_dns_response_rcode_count_total{rcode="SERVFAIL"}[5m]) > 0.01 # Kubernetes endpoint availability kube_endpoint_address_available{endpoint="my-service"} == 0 # TCP retransmits (sign of network congestion) rate(node_netstat_Tcp_RetransSegs[5m]) > 10 ``` ### Grafana Dashboards Key dashboards to maintain: - **Node Exporter Full** (dashboard ID 1860) — network interface metrics per node - **Kubernetes Networking** — pod/service network traffic - **CoreDNS** — DNS query rates, SERVFAIL rates, response times - **Blackbox Exporter** — endpoint availability and probe duration ### Proactive Alerting ```yaml # AlertManager rule example groups: - name: network rules: - alert: ServiceEndpointDown expr: kube_endpoint_address_available == 0 for: 1m labels: severity: critical annotations: summary: "Kubernetes service {{ $labels.endpoint }} has no available endpoints" - alert: DNSHighLatency expr: histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m])) > 0.5 for: 5m labels: severity: warning annotations: summary: "CoreDNS p99 latency > 500ms" ``` ### Continuous Connectivity Testing Run synthetic monitoring probes from within your cluster: ```bash # Deploy a simple network probe pod that tests connectivity continuously kubectl apply -f - <<EOF apiVersion: apps/v1 kind: Deployment metadata: name: network-probe spec: replicas: 1 selector: matchLabels: app: network-probe template: metadata: labels: app: network-probe spec: containers: - name: probe image: nicolaka/netshoot command: ["/bin/sh", "-c"] args: - while true; do nc -zv postgres-service.databases 5432 && echo "DB OK" || echo "DB FAIL"; sleep 10; done EOF kubectl logs -f deployment/network-probe ``` --- ## 15. Best Practices Checklist **Investigation practices:** - Always start with `ping` before anything else — establish whether basic connectivity exists - Always check DNS separately from connectivity — they fail independently - Always run diagnostic commands from both ends (source and destination) when possible - Save tcpdump captures (`-w file.pcap`) before the issue clears itself - Document your debugging steps — you'll face this issue again - Check "what changed" in your deployment pipeline before spending time on tools **Infrastructure practices:** - Implement health checks and readiness probes on all Kubernetes workloads - Always set resource limits — a pod consuming all CPU can cause DNS timeouts that look like network issues - Use NetworkPolicies in Kubernetes but test them in audit mode first - Keep firewall rules documented and in version control (Terraform, Pulumi) - Enable VPC Flow Logs in cloud environments — they're invaluable after the fact - Set up Blackbox Exporter probes for all critical service endpoints - Monitor CoreDNS health metrics actively **Security practices:** - Default-deny NetworkPolicies in Kubernetes namespaces, then explicitly allow - Use security group/NACL changes in change management — they're silent and cause immediate outages - Regularly audit firewall rules for stale entries **Operational practices:** - Maintain a network diagram — knowing expected topology cuts debug time in half - Keep `netshoot` or similar debug images available in your container registry - Create runbooks for known failure patterns (DNS failures, endpoint empty, etc.) - Add network-layer metrics to your SLOs — don't just track application error rates --- ## 16. Conclusion Network troubleshooting is a skill that compounds over time. The engineer who's debugged a hundred incidents builds a mental model that shortcuts the diagnostic process — they know where to look first because they've seen the patterns. **The core mental model:** Connectivity is a chain. Every link in that chain (DNS, routing, firewall, application) must work for the end result to work. Your job is to find the broken link, and the fastest way to do that is to test each link systematically rather than randomly. **Key principles to internalize:** - Packets don't lie. When in doubt, `tcpdump` at both ends. - DNS is nearly always involved. Test it early, test it explicitly. - "Connection timeout" and "Connection refused" mean different things — read the error carefully. - Empty Kubernetes endpoints cause more service outages than any other single issue. - The most recent change is usually the cause. Check your deployment history before spending 30 minutes with tools. **Master these tools first:** `ping`, `dig`, `ss`, `curl -v`, `nc`, `tcpdump`. With these six, you can resolve 90% of production network issues. The rest — `conntrack`, `nsenter`, `strace` — are for the 10% of deep-dive investigations. Network troubleshooting is not magic. It's methodology, layered knowledge, and the right tools applied in the right order. Build that foundation, and production outages become problems to solve rather than fires to fight. --- _Vladimiras Levinas is a Lead DevOps Engineer with 18+ years in fintech infrastructure. He runs a production K3s homelab and writes about AI infrastructure at doc.thedevops.dev_