> A production-grade handbook for DevOps Engineers, SREs, Platform Engineers, and Kubernetes Engineers
## 1. Introduction
Network issues cause the majority of production outages. Whether it's a microservice that can't reach its database, a Kubernetes pod stuck in `CrashLoopBackOff` due to a DNS failure, or a load balancer silently dropping traffic — understanding how to navigate these problems quickly is what separates a junior engineer from a senior one.
What makes network troubleshooting uniquely challenging is the number of layers involved. A single failed HTTP request can originate from a misconfigured security group in AWS, a broken CoreDNS pod in Kubernetes, a firewall rule added by a teammate, a wrong route in an OS routing table, or a misconfigured application binding to the wrong interface. Without a systematic approach, you're guessing.
**Common real-world scenarios you'll encounter:**
- **Service unreachable** — `curl` returns `Connection refused` or times out
- **Kubernetes pod communication failure** — pods can't reach each other or services
- **DNS failures** — names don't resolve, or resolve to wrong IPs
- **Load balancer misconfiguration** — traffic reaches the LB but never hits backend pods
- **Firewall blocking traffic** — connectivity works from some hosts but not others
**The layered troubleshooting mindset** is everything. Always start at the bottom (physical/IP connectivity) and work your way up (application). Jumping straight to application logs when the real problem is a firewall rule costs hours.
```
Application layer → Is the app listening? Is it returning errors?
Transport layer → Is the port open? Are connections being established?
Network layer → Can packets route between hosts?
Data link/physical → Is there connectivity at all?
```
---
## 2. How Networking Works: A DevOps Perspective
### OSI vs TCP/IP — What Actually Matters
Forget memorizing 7 OSI layers for an exam. In production, you work with 4 practical layers:
|Layer|Protocol|Your Tools|
|---|---|---|
|Application|HTTP, gRPC, DNS, TLS|curl, dig, openssl|
|Transport|TCP, UDP|ss, netstat, nc|
|Network|IP, ICMP|ping, traceroute, ip route|
|Link/Physical|Ethernet, ARP|ip link, arp, ethtool|
### What Happens When You Run `curl https://example.com`
Understanding this flow tells you exactly where to look when things break.
```
┌─────────────────────────────────────────────────────────────────┐
│ curl https://example.com │
│ │
│ 1. DNS Resolution │
│ └─ Check /etc/hosts → /etc/nsswitch.conf → resolv.conf │
│ → Query DNS server (e.g., 8.8.8.8) │
│ → Returns: 93.184.216.34 │
│ │
│ 2. Routing Decision │
│ └─ Check routing table: which interface to use? │
│ → Selects eth0, gateway 192.168.1.1 │
│ │
│ 3. TCP Handshake (port 443) │
│ └─ SYN → server │
│ SYN-ACK ← server │
│ ACK → server │
│ │
│ 4. TLS Handshake │
│ └─ ClientHello → server │
│ ServerHello + Certificate ← server │
│ Key exchange, session established │
│ │
│ 5. HTTP Request │
│ └─ GET / HTTP/1.1 │
│ Host: example.com │
│ │
│ 6. Response │
│ └─ HTTP/1.1 200 OK │
└─────────────────────────────────────────────────────────────────┘
```
**Where failures occur at each step:**
- **DNS** → `could not resolve host` — check `/etc/resolv.conf`, test with `dig`
- **Routing** → `Network unreachable` — check `ip route`, gateway, VPC routing tables
- **TCP handshake** → `Connection refused` (port closed) or timeout (firewall dropping)
- **TLS** → `SSL certificate verify failed`, expired cert, wrong hostname
- **HTTP** → 4xx/5xx errors, application-level problems
---
## 3. Systematic Troubleshooting Methodology
Gut-feel debugging is slow and inconsistent. A structured approach cuts your mean time to resolution (MTTR) dramatically.
### The Six-Step Framework
**Step 1: Define the problem precisely**
Before touching a single tool, answer:
- What is the exact error message?
- What is the source (client IP/pod/service)?
- What is the destination (IP/hostname/port/protocol)?
- When did it start? What changed?
- Is it affecting all requests or some?
```bash
# Gather basic context
hostname && ip addr show
date && uptime
last reboot
```
**Step 2: Verify basic connectivity (Layer 3)**
```bash
# Can you reach the host at all?
ping -c 4 <target-ip>
# What's the path?
traceroute <target-ip>
```
If `ping` fails to an IP: routing or firewall issue. If `ping` succeeds but everything else fails: higher-layer problem.
**Step 3: Verify DNS resolution**
```bash
dig <hostname>
dig @8.8.8.8 <hostname> # Bypass local resolver
nslookup <hostname>
```
If the hostname doesn't resolve, or resolves to the wrong IP, you've found your problem.
**Step 4: Check routing**
```bash
ip route get <target-ip> # Shows which route will be used
ip route show # Full routing table
```
**Step 5: Check firewall and port accessibility**
```bash
nc -zv <target-ip> <port> # Is the port reachable?
telnet <target-ip> <port>
ss -tulnp # Is the service listening locally?
iptables -L -n -v # Any rules blocking traffic?
```
**Step 6: Check the application layer**
```bash
curl -v http://<target>:<port>/health
curl -vvv --resolve <host>:<port>:<ip> https://<host>/path
journalctl -u <service> --since "10 min ago"
kubectl logs <pod> --previous
```
### Decision Tree
```
Is ping to target IP working?
├── NO → Routing/firewall issue
│ ├── Check: ip route get <ip>
│ ├── Check: iptables -L
│ └── Check: Cloud security groups
└── YES → Is DNS resolving correctly?
├── NO → DNS issue
│ ├── Check: /etc/resolv.conf
│ ├── Check: dig @<nameserver> <host>
│ └── K8s: check CoreDNS pods
└── YES → Is the port open?
├── NO → Service not running or firewall blocking
│ ├── Check: ss -tulnp
│ └── Check: nc -zv host port
└── YES → Application layer issue
├── Check: curl -v
├── Check: app logs
└── Check: TLS certificate
```
---
## 4. Essential Linux Network Troubleshooting Tools
### ping — Baseline Connectivity
**What it does:** Sends ICMP echo requests to verify Layer 3 reachability and measure round-trip time.
**When to use:** First check in any troubleshooting session.
```bash
ping -c 4 8.8.8.8 # 4 packets to Google DNS
ping -c 4 google.com # Tests both DNS and connectivity
ping -I eth0 10.0.0.5 # Force specific interface
ping -s 1400 10.0.0.5 # Test with larger packet (MTU issues)
```
**Interpreting output:**
```
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=118 time=12.3 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=118 time=11.8 ms
--- 8.8.8.8 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 11.8/12.0/12.3/0.2 ms
```
- `ttl` decreasing across attempts = possible routing asymmetry
- High `mdev` (variance) = network instability
- 100% packet loss to an IP that should be reachable = firewall dropping ICMP or host down
---
### traceroute / tracepath — Path Analysis
**What it does:** Shows every network hop between you and the destination. Identifies where packets stop.
```bash
traceroute 8.8.8.8
traceroute -T -p 443 api.example.com # TCP-based traceroute (bypasses ICMP blocks)
tracepath 8.8.8.8 # Similar, no root required
mtr --report 8.8.8.8 # Real-time, combines ping+traceroute
```
**Production use:** If traceroute shows packets reaching hop 7 but never hop 8, the problem is between those two nodes — which might be a cloud router, firewall, or misconfigured VPN gateway.
---
### ip — Interface and Route Management
**What it does:** The modern replacement for `ifconfig` and `route`. Manages interfaces, addresses, routes, and more.
```bash
ip addr show # All interfaces and IPs
ip addr show eth0 # Specific interface
ip route show # Routing table
ip route get 10.0.1.5 # Which route would be used?
ip link show # Interface state (UP/DOWN)
ip neigh show # ARP table
```
**Real-world example:**
```bash
$ ip route get 10.96.0.1
10.96.0.1 via 192.168.1.1 dev eth0 src 192.168.1.50 uid 0
cache
```
This tells you: traffic to `10.96.0.1` (Kubernetes ClusterIP) goes via gateway `192.168.1.1` on `eth0`. If you expect it to go through a Kubernetes CNI interface instead, something is misconfigured.
---
### ss — Socket Statistics
**What it does:** Shows open sockets, listening ports, and established connections. Faster and more powerful than `netstat`.
```bash
ss -tulnp # TCP+UDP, listening only, with process names
ss -tnp state established # All established TCP connections
ss -s # Summary statistics
ss -tulnp | grep :443 # Who is listening on 443?
ss -tnp dst 10.0.0.5 # Connections to specific destination
```
**Interpreting output:**
```
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
tcp LISTEN 0 128 0.0.0.0:80 0.0.0.0:* users:(("nginx",pid=1234,fd=6))
tcp ESTAB 0 0 10.0.0.10:54312 10.0.0.5:443 users:(("curl",pid=5678,fd=5))
```
- `Recv-Q` nonzero on LISTEN: app is not accepting connections fast enough (backlog full)
- `Send-Q` nonzero on ESTAB: network congestion, destination not consuming data
---
### curl — HTTP/HTTPS Testing
**What it does:** The Swiss Army knife for testing HTTP endpoints.
```bash
curl -v http://service:8080/health # Verbose output
curl -I https://example.com # Headers only
curl -w "\nTime: %{time_total}s\n" https://example.com # Timing
curl --connect-timeout 5 --max-time 10 http://slow-service/
curl -k https://self-signed.example.com # Skip TLS verification
curl --resolve api.example.com:443:10.0.0.5 https://api.example.com # Override DNS
curl -H "Host: myapp.example.com" http://10.0.0.5/ # Test ingress with custom Host header
```
**The timing breakdown is gold for diagnosing slow requests:**
```bash
curl -w "DNS: %{time_namelookup}s | Connect: %{time_connect}s | TLS: %{time_appconnect}s | Total: %{time_total}s\n" \
-o /dev/null -s https://example.com
```
---
### dig — DNS Interrogation
**What it does:** Queries DNS servers directly. The primary tool for DNS troubleshooting.
```bash
dig google.com # A record
dig google.com AAAA # IPv6 record
dig google.com MX # Mail records
dig @8.8.8.8 google.com # Query specific nameserver
dig +short google.com # IP only
dig +trace google.com # Full resolution chain
dig -x 8.8.8.8 # Reverse DNS
dig @10.96.0.10 kubernetes.default.svc.cluster.local # K8s CoreDNS
```
**Reading dig output:**
```
;; ANSWER SECTION:
google.com. 299 IN A 142.250.80.46
;; Query time: 12 msec
;; SERVER: 8.8.8.8#53
```
- `299` = TTL in seconds (low TTL = DNS changes propagate quickly)
- `SERVER` = which resolver actually answered
- No ANSWER section = DNS record doesn't exist or resolution failed
---
### tcpdump — Packet Capture
**What it does:** Captures and analyzes raw network packets. The ultimate source of truth.
```bash
tcpdump -i eth0 # All traffic on eth0
tcpdump -i any port 443 # All TLS traffic, any interface
tcpdump host 10.0.0.5 # Traffic to/from specific host
tcpdump src 10.0.0.5 and dst port 8080 # Filtered
tcpdump -w /tmp/capture.pcap -i eth0 # Save to file (analyze in Wireshark)
tcpdump -i eth0 -nn -v port 53 # DNS queries, no hostname resolution
tcpdump 'tcp[tcpflags] & (tcp-syn) != 0' # SYN packets only
```
**Reading a TCP handshake:**
```
14:23:01 IP 10.0.0.10.54312 > 10.0.0.5.443: Flags [S], seq 1234567890
14:23:01 IP 10.0.0.5.443 > 10.0.0.10.54312: Flags [S.], seq 9876543210, ack 1234567891
14:23:01 IP 10.0.0.10.54312 > 10.0.0.5.443: Flags [.], ack 9876543211
```
- `[S]` = SYN (initiating connection)
- `[S.]` = SYN-ACK (server acknowledging)
- `[.]` = ACK (connection established)
- `[R]` = RST (connection refused/reset — port closed or firewall rejecting)
- `[F]` = FIN (graceful close)
If you see SYN packets leaving but no SYN-ACK arriving, a firewall is dropping packets.
---
### nc (netcat) — Swiss Army Knife for TCP/UDP
**What it does:** Opens raw TCP/UDP connections, useful for port testing and basic service simulation.
```bash
nc -zv 10.0.0.5 443 # Test if port is open (verbose)
nc -zv 10.0.0.5 8080-8090 # Scan port range
nc -l 8080 # Listen on port 8080 (simple server)
echo "GET / HTTP/1.0" | nc 10.0.0.5 80 # Raw HTTP request
nc -u 10.0.0.5 514 # UDP test (syslog port)
```
---
### nmap — Network Scanner
**What it does:** Scans ports and services across one or many hosts.
```bash
nmap -p 443,8080,8443 10.0.0.5 # Scan specific ports
nmap -p- 10.0.0.5 # All 65535 ports
nmap -sV 10.0.0.5 # Version detection
nmap -sn 10.0.0.0/24 # Host discovery (no port scan)
nmap --script ssl-cert 10.0.0.5 -p 443 # Check TLS certificate
```
**Note:** Use with permission. In production environments, coordinate scans to avoid triggering security alerts.
---
## 5. Troubleshooting DNS Problems
DNS issues cause a disproportionate number of production incidents, and they're often subtle — everything looks fine until a TTL expires or a pod restarts.
### How DNS Resolution Works on Linux
```
Application → glibc resolver
→ Check /etc/nsswitch.conf (order: files, dns)
→ Check /etc/hosts (files)
→ Query /etc/resolv.conf nameserver(s)
└── systemd-resolved (127.0.0.53) on modern systems
└── Upstream DNS (8.8.8.8, or corporate DNS)
```
```bash
cat /etc/resolv.conf
# nameserver 127.0.0.53 <- systemd-resolved stub
# nameserver 10.96.0.10 <- K8s CoreDNS (inside pods)
# search default.svc.cluster.local svc.cluster.local cluster.local
# options ndots:5
```
### Common DNS Issues and How to Debug Them
**Issue: DNS timeout**
```bash
# How long does resolution take?
time dig google.com
# Is the resolver responding?
dig @127.0.0.53 google.com
dig @8.8.8.8 google.com # Bypass local resolver
# Check systemd-resolved status
systemd-resolve --status
resolvectl status
```
**Issue: Wrong DNS server configured**
```bash
# See what resolver is being used
cat /etc/resolv.conf
# On systemd-resolved systems
resolvectl dns
# Override for a single query
dig @10.0.0.1 internal.service.corp
```
**Issue: DNS works for public names but not internal**
```bash
# Check search domains
cat /etc/resolv.conf | grep search
# Manually test internal name with full FQDN
dig internal.service.corp.
dig internal.service.corp # Note trailing dot matters
```
### Kubernetes CoreDNS
Inside Kubernetes pods, DNS is handled by CoreDNS at the `kube-dns` ClusterIP (typically `10.96.0.10`).
```bash
# Verify CoreDNS pods are running
kubectl -n kube-system get pods -l k8s-app=kube-dns
# Test DNS from inside a pod
kubectl run -it --rm debug --image=busybox --restart=Never -- sh
# Inside pod:
nslookup kubernetes.default
nslookup myservice.mynamespace.svc.cluster.local
cat /etc/resolv.conf
# Check CoreDNS logs
kubectl -n kube-system logs -l k8s-app=kube-dns --tail=50
# Check CoreDNS ConfigMap
kubectl -n kube-system get configmap coredns -o yaml
```
**The `ndots:5` setting explained:** In Kubernetes, short names like `myservice` trigger up to 5 search domain attempts before falling back to the root. This means `myservice` expands to `myservice.default.svc.cluster.local`, then `myservice.svc.cluster.local`, etc. This can cause DNS timeouts when hitting external names — consider using FQDNs with a trailing dot for external endpoints.
---
## 6. Troubleshooting Connectivity Issues
### Localhost Issues
```bash
# Is the service bound to the right interface?
ss -tulnp | grep <port>
# Service bound to 127.0.0.1 won't be reachable externally
# Service bound to 0.0.0.0 listens on all interfaces
```
If a service is bound to `127.0.0.1:8080` and you're trying to reach it from another host — that's your problem. Check the application configuration to bind to `0.0.0.0` or the specific external IP.
### Server-to-Server Connectivity
```bash
# From source server, test destination
ping <destination-ip>
nc -zv <destination-ip> <port>
curl -v http://<destination-ip>:<port>/health
# Check routing
ip route get <destination-ip>
# Example output
$ ip route get 10.0.1.50
10.0.1.50 via 10.0.0.1 dev eth0 src 10.0.0.10 uid 0
```
**Reading the routing table:**
```bash
$ ip route show
default via 192.168.1.1 dev eth0 proto dhcp
10.0.0.0/8 via 10.10.0.1 dev vpn0 # Internal traffic via VPN
172.16.0.0/12 via 10.10.0.1 dev vpn0
192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.50
```
Traffic to `10.0.1.50` matches the `/8` route and goes via the VPN. If that VPN tunnel is down, connection fails even though the host is physically reachable.
### Container Networking Issues
In Docker/Kubernetes, containers have their own network namespace with separate interfaces and routes.
```bash
# Docker: inspect container network
docker inspect <container> | grep -i network
docker exec <container> ip addr
docker exec <container> ip route
docker exec <container> cat /etc/resolv.conf
# Check Docker bridge network
ip link show docker0
bridge link show
```
---
## 7. Troubleshooting Ports and Services
### Is the Service Listening?
```bash
ss -tulnp # All listening sockets with process
ss -tulnp | grep :8080 # Specific port
ss -tulnp | grep nginx # Specific process
# If ss not available (old systems)
netstat -tulnp
netstat -tulnp | grep LISTEN
```
### Is the Port Reachable Remotely?
```bash
nc -zv 10.0.0.5 8080 # Quick port test
nc -zv -w 3 10.0.0.5 8080 # 3 second timeout
# Test from within Kubernetes pod
kubectl exec -it <pod> -- nc -zv <service-name> <port>
kubectl exec -it <pod> -- wget -qO- http://<service>:<port>/health
```
**Interpreting nc results:**
```
Connection to 10.0.0.5 8080 port [tcp/http-alt] succeeded! # Port open
nc: connectx to 10.0.0.5 port 8080 (tcp) failed: Connection refused # Port closed
# (hangs/timeout) = firewall dropping packets silently
```
The difference between "Connection refused" and a timeout is critical:
- **Refused** = host is reachable, but nothing is listening on that port (or iptables REJECT)
- **Timeout** = packets are being dropped (firewall DROP rule, routing issue, host unreachable)
---
## 8. Firewall Troubleshooting
### iptables
```bash
# View all rules with packet counts
iptables -L -n -v
# View NAT table (important for K8s/Docker)
iptables -t nat -L -n -v
# View filter table explicitly
iptables -t filter -L INPUT -n -v --line-numbers
# Check if a specific port is blocked
iptables -L INPUT -n | grep DROP
iptables -L INPUT -n | grep REJECT
```
**Understanding chains:**
- **INPUT** — traffic destined for this host
- **OUTPUT** — traffic originating from this host
- **FORWARD** — traffic passing through this host (relevant for routers, K8s nodes)
- **PREROUTING** (nat table) — DNAT happens here (e.g., K8s service VIP → pod IP)
- **POSTROUTING** (nat table) — SNAT/masquerade happens here
**Real-world K8s iptables example:**
When you access a Kubernetes ClusterIP service, iptables intercepts the traffic and rewrites the destination to a backend pod IP using DNAT rules created by kube-proxy:
```bash
# See K8s service rules
iptables -t nat -L KUBE-SERVICES -n -v | grep 10.96.0.10
iptables -t nat -L KUBE-SVC-<hash> -n -v
# Trace a packet through iptables (kernel module)
modprobe xt_LOG
iptables -t raw -I PREROUTING -p tcp --dport 8080 -j LOG --log-prefix "PKT: "
# Watch: dmesg | grep PKT
# Clean up: iptables -t raw -D PREROUTING -p tcp --dport 8080 -j LOG --log-prefix "PKT: "
```
### nftables
nftables replaces iptables on modern distributions (RHEL 8+, Debian 10+):
```bash
nft list ruleset # All rules
nft list table inet filter # Filter table
nft list chain inet filter input # Input chain
```
### ufw (Ubuntu Firewall)
```bash
ufw status verbose # Current rules and status
ufw allow 8080/tcp # Allow port
ufw deny from 10.0.0.5 # Block source IP
ufw logging on # Enable logging (/var/log/ufw.log)
```
---
## 9. Packet-Level Troubleshooting with tcpdump
tcpdump is your ground truth. When you can't trust what the application says, packets don't lie.
### Confirming Traffic Reaches the Server
```bash
# On the server, capture incoming connections
tcpdump -i any -nn port 8080
# On the server, filter for specific client
tcpdump -i any -nn src 10.0.0.10 and port 8080
# Capture and save for later analysis
tcpdump -i eth0 -w /tmp/debug.pcap port 443
# Transfer to your laptop and open in Wireshark
```
### Diagnosing Connection Failures
**Scenario: Client sends SYN, gets no response**
```bash
# On client
tcpdump -i eth0 host 10.0.0.5 and port 8080
# See: SYN packets going out, nothing coming back
# Conclusion: Firewall is dropping packets (DROP rule, security group)
# On server
tcpdump -i eth0 port 8080
# If SYN packets don't appear here: firewall before server
# If SYN packets appear but no SYN-ACK: server-side issue (app not listening, server firewall)
```
**Scenario: Connection established but no data**
```bash
tcpdump -i any -nn -A host 10.0.0.5 and port 8080
# -A prints ASCII content
# Look for HTTP request/response or lack thereof
```
**Advanced tcpdump filters:**
```bash
# Only SYN packets (new connections)
tcpdump 'tcp[tcpflags] & tcp-syn != 0'
# Only RST packets (connection resets)
tcpdump 'tcp[tcpflags] & tcp-rst != 0'
# HTTP GET requests
tcpdump -A -s 0 'tcp dst port 80 and tcp[((tcp[12:1] & 0xf0) >> 2):4] = 0x47455420'
# Large packets (MTU debugging)
tcpdump 'ip[2:2] > 1400'
# DNS queries
tcpdump -i any -nn port 53
```
---
## 10. Kubernetes Network Troubleshooting
Kubernetes networking has multiple layers: pod networking, service networking, and ingress. Each can fail independently.
### Pod Networking Fundamentals
Every pod gets its own IP (managed by CNI: Calico, Flannel, Cilium, etc.). Pods can communicate directly via IP across nodes — if CNI is working correctly.
```bash
# Get pod IPs
kubectl get pods -o wide -n <namespace>
# Check which node a pod is on
kubectl get pod <pod> -o wide
# Test connectivity from inside a pod
kubectl exec -it <pod> -- ping <other-pod-ip>
kubectl exec -it <pod> -- nc -zv <service-name> <port>
kubectl exec -it <pod> -- wget -qO- http://<service>:<port>/health
```
### Service Networking
Kubernetes Services create virtual IPs (ClusterIP) and route traffic to matching pods via iptables/IPVS rules set up by kube-proxy.
```bash
# Inspect a service
kubectl get svc <service-name> -o wide
kubectl describe svc <service-name>
# Verify endpoints exist (if Endpoints is empty, no pods match the selector)
kubectl get endpoints <service-name>
kubectl describe endpoints <service-name>
```
**Empty Endpoints is the #1 cause of "Service unreachable" in Kubernetes.** This means no pods match the service selector. Check:
```bash
# What selector does the service use?
kubectl get svc <service> -o jsonpath='{.spec.selector}'
# Do any pods match?
kubectl get pods -l app=myapp # Replace with your selector labels
kubectl get pods --show-labels | grep <label>
```
### Debugging with Ephemeral Containers
```bash
# Run a debug pod in same namespace
kubectl run debug-pod --image=nicolaka/netshoot -it --rm --restart=Never -- bash
# Inside netshoot: dig, curl, tcpdump, iperf all available
# Attach to running pod's network namespace (K8s 1.23+)
kubectl debug -it <pod> --image=nicolaka/netshoot --target=<container>
```
### Ingress Troubleshooting
```bash
# Check ingress configuration
kubectl get ingress -A
kubectl describe ingress <name>
# Check ingress controller pods
kubectl -n ingress-nginx get pods
kubectl -n ingress-nginx logs <ingress-pod> --tail=50
# Test with explicit Host header
curl -H "Host: myapp.example.com" http://<ingress-controller-ip>/path
# Test TLS
openssl s_client -connect myapp.example.com:443 -servername myapp.example.com
```
### CNI Issues
If pods on different nodes can't communicate:
```bash
# Check CNI pods (Calico example)
kubectl -n kube-system get pods -l k8s-app=calico-node
# Check node-level routes
ip route show | grep <pod-cidr>
# Verify CNI interface exists
ip link show cali* # Calico
ip link show flannel* # Flannel
ip link show cilium* # Cilium
# Check for CNI config
ls /etc/cni/net.d/
cat /etc/cni/net.d/10-calico.conflist
```
---
## 11. Cloud Network Troubleshooting
### AWS
**Security Groups** are the most common source of connectivity problems in AWS. Unlike iptables, they are stateful — allowing inbound automatically allows return traffic.
```bash
# From the AWS CLI
aws ec2 describe-security-groups --group-ids sg-xxxxxxxx
aws ec2 describe-network-acls --filters Name=vpc-id,Values=vpc-xxxxxxxx
# Check effective security groups on an instance
aws ec2 describe-instances --instance-ids i-xxxxxxxx \
--query 'Reservations[].Instances[].SecurityGroups'
```
**Common AWS network gotchas:**
- Security Group allows port 8080, but the **application is binding to 127.0.0.1** — packets arrive but are rejected by OS
- **NACLs are stateless** — you need both inbound AND outbound rules (unlike Security Groups)
- **VPC Peering** is not transitive — A peers with B, B peers with C ≠ A can reach C
- **Route tables** — subnets need explicit routes to reach peered VPCs, VPN gateways, etc.
```bash
# Test from EC2 instance
curl http://169.254.169.254/latest/meta-data/ # Instance metadata (verify IMDSv2)
curl http://169.254.169.254/latest/meta-data/local-ipv4
# VPC Flow Logs — enable on suspect subnets, then query CloudWatch Logs
# Look for REJECT action on expected traffic
```
### GCP
```bash
# Check firewall rules
gcloud compute firewall-rules list
gcloud compute firewall-rules describe <rule-name>
# Check routes
gcloud compute routes list
# VPC network details
gcloud compute networks describe <network-name>
```
**GCP-specific gotchas:** Firewall rules apply to the entire VPC network, not subnets. Target tags or service accounts control which VMs the rule applies to.
### Azure
In Azure, **Network Security Groups (NSGs)** can be attached at both the subnet level and the NIC level — both are evaluated. A common mistake is configuring the NIC NSG but forgetting the subnet NSG, or vice versa.
```bash
az network nsg show -g <resource-group> -n <nsg-name>
az network nsg rule list -g <resource-group> --nsg-name <nsg-name>
az network nic show -g <resource-group> -n <nic-name>
```
---
## 12. Real Production Incident Walkthrough
### Scenario: "Payment Service Unreachable After Deployment"
**Alert received:** `payment-service` health check failing. 0% success rate for 5 minutes.
**Step 1: Define the problem**
```bash
kubectl get pods -n payments
# NAME READY STATUS RESTARTS AGE
# payment-svc-7d9f8b6-xk2pq 0/1 Running 0 3m
# payment-svc-7d9f8b6-mn8qt 0/1 Running 0 3m
```
Pods are running but not READY. Something is failing the readiness probe.
**Step 2: Check events and logs**
```bash
kubectl describe pod payment-svc-7d9f8b6-xk2pq -n payments
# Events:
# Warning Unhealthy 2m kubelet Readiness probe failed: Get "http://10.0.1.45:8080/health": dial tcp 10.0.1.45:8080: connect: connection refused
kubectl logs payment-svc-7d9f8b6-xk2pq -n payments --tail=30
# Error: Cannot connect to database: dial tcp 10.96.45.12:5432: i/o timeout
```
App is running but can't reach its database.
**Step 3: DNS and service check**
```bash
kubectl exec -it payment-svc-7d9f8b6-xk2pq -n payments -- sh
# Inside pod:
nslookup postgres-service.databases.svc.cluster.local
# Server: 10.96.0.10
# Non-authoritative answer: Name: postgres-service.databases.svc.cluster.local
# Address: 10.96.45.12
```
DNS resolves correctly.
**Step 4: Test connectivity**
```bash
# Still inside pod
nc -zv 10.96.45.12 5432
# (hangs — timeout, not refused)
```
Port times out. Either the service has no endpoints, or a NetworkPolicy is blocking it.
**Step 5: Check endpoints**
```bash
kubectl get endpoints postgres-service -n databases
# NAME ENDPOINTS AGE
# postgres-service <none> 45m
```
**No endpoints!** The service has no backing pods.
**Step 6: Find the root cause**
```bash
kubectl get pods -n databases
# NAME READY STATUS RESTARTS AGE
# postgres-0 0/1 ImagePullBackOff 0 46m
```
The database pod failed to start due to `ImagePullBackOff`. During the deployment, someone updated the database image tag in the Helm values and pushed an image that doesn't exist in the registry.
**Resolution:**
```bash
# Fix the image tag
helm upgrade postgres ./charts/postgres -n databases --set image.tag=15.3
# Verify pod comes up
kubectl get pods -n databases -w
# Verify endpoints populate
kubectl get endpoints postgres-service -n databases
# NAME ENDPOINTS AGE
# postgres-service 10.0.1.82:5432 2m
# Verify payment service recovers
kubectl get pods -n payments
```
**Total resolution time: 11 minutes.** The structured approach — checking events, logs, DNS, connectivity, endpoints in sequence — avoided hours of guessing.
---
## 13. Advanced Troubleshooting Techniques
### conntrack — Connection Tracking
The Linux connection tracking table records all NAT'd connections. Useful for debugging K8s service routing and SNAT issues.
```bash
conntrack -L # List all tracked connections
conntrack -L | grep 10.0.0.5 # Filter by IP
conntrack -L | wc -l # Total tracked connections
# If this is near nf_conntrack_max, you'll drop connections
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
# Watch new connections in real-time
conntrack -E -e NEW
```
**High conntrack count is a real production issue.** Under heavy load, it can exhaust the conntrack table, causing new connections to silently fail with no error.
### Network Namespaces
Containers and pods have isolated network namespaces. To troubleshoot at the packet level inside a container without installing tools in the container:
```bash
# Find the container PID
docker inspect <container> | grep Pid
# Or for K8s
crictl inspect <container-id> | grep pid
# Enter the network namespace
nsenter -t <pid> -n -- ip addr show
nsenter -t <pid> -n -- ss -tulnp
nsenter -t <pid> -n -- tcpdump -i any port 8080
```
### Advanced ss Filters
```bash
# Show only connections in TIME_WAIT (can indicate connection storm)
ss -tn state time-wait | wc -l
# Show sockets by memory usage (find memory hog)
ss -tm
# Connections to a specific destination port
ss -tn dst :443
# Filter by source address
ss -tn src 10.0.0.5
```
### strace for Socket Debugging
When you need to know exactly what syscalls an application makes:
```bash
strace -e trace=network -p <pid>
strace -e connect,bind,sendto,recvfrom curl http://example.com
```
This shows every `connect()` call, which IP:port the app is trying to reach, and what errors it receives — invaluable when the app logs are ambiguous.
---
## 14. Automation and Monitoring
The best network troubleshooting is the one you don't have to do because your monitoring caught the issue first.
### Key Metrics to Monitor with Prometheus
```yaml
# Key network metrics to alert on:
# Blackbox exporter — probe availability
probe_success{job="blackbox", instance="https://api.example.com"} == 0
# Node exporter — interface errors
rate(node_network_receive_errs_total[5m]) > 0
rate(node_network_transmit_errs_total[5m]) > 0
# DNS resolution failures (CoreDNS)
rate(coredns_dns_response_rcode_count_total{rcode="SERVFAIL"}[5m]) > 0.01
# Kubernetes endpoint availability
kube_endpoint_address_available{endpoint="my-service"} == 0
# TCP retransmits (sign of network congestion)
rate(node_netstat_Tcp_RetransSegs[5m]) > 10
```
### Grafana Dashboards
Key dashboards to maintain:
- **Node Exporter Full** (dashboard ID 1860) — network interface metrics per node
- **Kubernetes Networking** — pod/service network traffic
- **CoreDNS** — DNS query rates, SERVFAIL rates, response times
- **Blackbox Exporter** — endpoint availability and probe duration
### Proactive Alerting
```yaml
# AlertManager rule example
groups:
- name: network
rules:
- alert: ServiceEndpointDown
expr: kube_endpoint_address_available == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kubernetes service {{ $labels.endpoint }} has no available endpoints"
- alert: DNSHighLatency
expr: histogram_quantile(0.99, rate(coredns_dns_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "CoreDNS p99 latency > 500ms"
```
### Continuous Connectivity Testing
Run synthetic monitoring probes from within your cluster:
```bash
# Deploy a simple network probe pod that tests connectivity continuously
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: network-probe
spec:
replicas: 1
selector:
matchLabels:
app: network-probe
template:
metadata:
labels:
app: network-probe
spec:
containers:
- name: probe
image: nicolaka/netshoot
command: ["/bin/sh", "-c"]
args:
- while true; do
nc -zv postgres-service.databases 5432 && echo "DB OK" || echo "DB FAIL";
sleep 10;
done
EOF
kubectl logs -f deployment/network-probe
```
---
## 15. Best Practices Checklist
**Investigation practices:**
- Always start with `ping` before anything else — establish whether basic connectivity exists
- Always check DNS separately from connectivity — they fail independently
- Always run diagnostic commands from both ends (source and destination) when possible
- Save tcpdump captures (`-w file.pcap`) before the issue clears itself
- Document your debugging steps — you'll face this issue again
- Check "what changed" in your deployment pipeline before spending time on tools
**Infrastructure practices:**
- Implement health checks and readiness probes on all Kubernetes workloads
- Always set resource limits — a pod consuming all CPU can cause DNS timeouts that look like network issues
- Use NetworkPolicies in Kubernetes but test them in audit mode first
- Keep firewall rules documented and in version control (Terraform, Pulumi)
- Enable VPC Flow Logs in cloud environments — they're invaluable after the fact
- Set up Blackbox Exporter probes for all critical service endpoints
- Monitor CoreDNS health metrics actively
**Security practices:**
- Default-deny NetworkPolicies in Kubernetes namespaces, then explicitly allow
- Use security group/NACL changes in change management — they're silent and cause immediate outages
- Regularly audit firewall rules for stale entries
**Operational practices:**
- Maintain a network diagram — knowing expected topology cuts debug time in half
- Keep `netshoot` or similar debug images available in your container registry
- Create runbooks for known failure patterns (DNS failures, endpoint empty, etc.)
- Add network-layer metrics to your SLOs — don't just track application error rates
---
## 16. Conclusion
Network troubleshooting is a skill that compounds over time. The engineer who's debugged a hundred incidents builds a mental model that shortcuts the diagnostic process — they know where to look first because they've seen the patterns.
**The core mental model:** Connectivity is a chain. Every link in that chain (DNS, routing, firewall, application) must work for the end result to work. Your job is to find the broken link, and the fastest way to do that is to test each link systematically rather than randomly.
**Key principles to internalize:**
- Packets don't lie. When in doubt, `tcpdump` at both ends.
- DNS is nearly always involved. Test it early, test it explicitly.
- "Connection timeout" and "Connection refused" mean different things — read the error carefully.
- Empty Kubernetes endpoints cause more service outages than any other single issue.
- The most recent change is usually the cause. Check your deployment history before spending 30 minutes with tools.
**Master these tools first:** `ping`, `dig`, `ss`, `curl -v`, `nc`, `tcpdump`. With these six, you can resolve 90% of production network issues. The rest — `conntrack`, `nsenter`, `strace` — are for the 10% of deep-dive investigations.
Network troubleshooting is not magic. It's methodology, layered knowledge, and the right tools applied in the right order. Build that foundation, and production outages become problems to solve rather than fires to fight.
---
_Vladimiras Levinas is a Lead DevOps Engineer with 18+ years in fintech infrastructure. He runs a production K3s homelab and writes about AI infrastructure at doc.thedevops.dev_