> *A senior engineer's field guide to Docker observability, production debugging, and the commands that actually matter when things break.* --- ## The Incident That Changed How I Use Docker It was a Tuesday afternoon. A payment API started returning 504s. The container was running — `docker ps` confirmed it. Logs looked clean at a glance. CPU metrics on the host seemed fine. The on-call engineer spent forty minutes refreshing dashboards, restarting the container twice, and escalating to the backend team before anyone thought to run `docker stats`. The container was using 98% of its memory limit. It had been leaking memory for six hours. A restart fixed it temporarily — until the leak filled the new allocation twelve minutes later. The problem was not Docker. The problem was that nobody ran the right command first. Most engineers who work with Docker daily use four commands: `docker build`, `docker run`, `docker ps`, and `docker logs`. These are the surface. Under that surface is a diagnostic toolkit that most engineers discover only after an incident forces them to — usually at 2am, under pressure, with stakeholders asking for updates. This article is about that toolkit. Ten commands that expose what Docker is actually doing, how it is managing your containers, where resources are going, and what changed since deployment. Followed by seven production scenarios and the exact commands that resolve them fastest. The difference between guessing and understanding is knowing which command to run first. --- ## The 10 Docker Commands That Actually Matter --- ### 1. `docker logs --since` Most engineers use `docker logs container-name` and scroll. During an incident, this is like reading a novel from page one to find a sentence on page 847. The `--since` flag scopes logs to a specific time window, which is exactly what incident investigation requires. ```bash docker logs --since=15m --timestamps api-container docker logs --since=2024-01-15T14:30:00 --until=2024-01-15T14:45:00 api-container docker logs --since=1h --tail=200 api-container ``` ![[Pasted image 20260315100915.png|1200]] **Why engineers underestimate it:** Scrolling logs feels like doing something. Time-scoped queries feel like extra effort. In reality, `--since` is the difference between reading 50 relevant lines and scrolling through 50,000. **Production scenario:** An API starts erroring at 14:32. You run `--since=14:30` and immediately see a database connection timeout that began exactly when a config change was deployed. Without the time scope, that error is buried in six hours of successful request logs. **Pro tip:** Combine with `--timestamps` always. Docker log timestamps are stored separately from log content, and without them you are debugging blind when log lines themselves contain no time reference. --- ### 2. `docker inspect` `docker inspect` outputs the complete runtime configuration of a container or image as JSON — everything Docker knows about that object. Environment variables, restart policy, mounted volumes, network settings, resource limits, health check configuration, and the actual command being executed. ```bash docker inspect api-container docker inspect --format='{{.HostConfig.RestartPolicy.Name}}' api-container docker inspect --format='{{range .Config.Env}}{{println .}}{{end}}' api-container docker inspect --format='{{json .Mounts}}' api-container ``` ![[Pasted image 20260315101129.png|1200]] **Why engineers underestimate it:** Most engineers assume what they configured is what Docker is running. This assumption is frequently wrong. Environment variables get overridden, volume mounts fail silently and fall back to container defaults, restart policies differ between environments. **Production scenario:** A service works in staging and fails in production. `docker inspect` on both environments reveals that the production container is missing an environment variable that was set in the Compose file but not in the Kubernetes manifest used for production. The application fails silently when the variable is absent rather than raising a configuration error. **Pro tip:** The `--format` flag with Go template syntax makes `docker inspect` scriptable. Use it in health check scripts, deployment validation steps, and CI pipelines to assert that containers are configured as expected before traffic is routed to them. --- ### 3. `docker exec` `docker exec` opens a shell or runs a command inside a running container. This is not a deployment tool — it is a diagnostic tool. The operational value is validating what the container actually sees at runtime: which files exist, which environment variables are present, whether network connectivity works from inside the container namespace. ```bash docker exec -it api-container sh docker exec api-container env | grep DATABASE docker exec api-container cat /etc/hosts docker exec api-container wget -qO- http://internal-service:8080/health ``` **Why engineers underestimate it:** Engineers often debug from the outside, assuming that host-level visibility is equivalent to container-level visibility. It is not. Network resolution, file paths, environment variables, and DNS behavior inside a container namespace can differ significantly from the host. **Production scenario:** A container cannot reach an internal service. From the host, the service resolves correctly. `docker exec` into the container reveals that the container's `/etc/hosts` lacks the custom entry added to the host, and DNS resolution inside the network namespace uses a different resolver. The fix is a DNS configuration change, not an application fix. **Pro tip:** Use `docker exec container-name env` as the first diagnostic step for any configuration-related issue. It shows exactly what the application sees, not what you think you configured. --- ### 4. `docker stats` `docker stats` provides a live, continuously updating view of resource consumption across all running containers: CPU percentage, memory usage and limit, memory percentage, network I/O, and block I/O. It is the equivalent of `top` for your container fleet. ```bash docker stats docker stats --no-stream docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}" docker stats api-container worker-container ``` ![[Pasted image 20260315101258.png|1200]] **Why engineers underestimate it:** Engineers instrument applications with external monitoring systems and forget that Docker provides direct, zero-configuration resource visibility. During incidents, waiting for a monitoring platform to surface a resource issue adds minutes that `docker stats` would resolve in seconds. **Production scenario:** During a load test, response times degrade after fifteen minutes. `docker stats` running in a terminal beside the load test reveals that the API container's memory usage grows linearly and hits its limit at exactly the point response times collapse. The container is then throttled by the OOM killer before restarting. The application has a memory leak triggered by sustained concurrent load — visible in ten seconds with `docker stats`, invisible in the application logs. **Pro tip:** `--no-stream` returns a single snapshot rather than a continuous feed. Use this in scripts that need a point-in-time resource report across all containers. --- ### 5. `docker system df` Docker accumulates disk usage silently and aggressively: pulled images, stopped containers, unused volumes, and build cache all consume space without automatic cleanup. `docker system df` shows exactly how much space each category is using and how much is reclaimable. ```bash docker system df docker system df -v ``` ![[Pasted image 20260315101352.png|1200]] **Why engineers underestimate it:** Disk space issues feel like infrastructure problems, not Docker problems. Engineers look at host disk usage, see 95% utilization, and call the infrastructure team. The root cause is 40GB of dangling image layers and 15GB of build cache that Docker accumulated over three months of CI runs. **Production scenario:** A CI server begins failing builds with no disk space errors. Nobody changed anything. `docker system df` reveals 280GB of accumulated images — every build pulled the latest base image, tagged it, and left the previous version as a dangling layer. The CI pipeline had no cleanup step because nobody knew Docker needed one. Three minutes of pruning recovers 240GB. **Pro tip:** Run `docker system df` weekly in production environments and set up alerting when Docker's disk usage exceeds a threshold. Disk-full container failures are entirely preventable with proactive monitoring. --- ### 6. `docker container prune` / `docker image prune` Cleanup commands for Docker's accumulated waste. The important distinction that most engineers miss: `docker system prune -a` deletes everything including images used by stopped containers, which can break environments that depend on locally built images not stored in a registry. ```bash docker container prune docker image prune docker image prune -a --filter "until=72h" docker volume prune docker builder prune --keep-storage 5GB ``` **Why engineers underestimate it:** Engineers either never clean up (leading to disk exhaustion) or run `docker system prune -a` indiscriminately (leading to broken builds and pulled images that should have been local). The granular prune commands are safer and more predictable. **Production scenario:** A CI server runs `docker system prune -a` in a cleanup cron job. The next morning, a build fails because a locally built base image — not pushed to the registry — was deleted. The build pipeline does not rebuild it automatically because it expects the image to exist. The fix requires understanding exactly what `prune -a` deleted and why. **Pro tip:** Use `--filter "until=72h"` to remove only images older than 72 hours, preserving recent builds while recovering space from accumulated layers. This is safe for most CI environments. --- ### 7. `docker logs -f` The `-f` flag follows log output in real time. This is the debugging workflow of running one terminal with `docker logs -f container` and a second terminal sending requests or triggering operations. Watching log output react to your actions in real time is often faster than any other debugging method. ```bash docker logs -f api-container docker logs -f --since=1m api-container docker logs -f api-container | grep -i error docker logs -f api-container 2>&1 | grep -E "ERROR|WARN|Exception" ``` **Why engineers underestimate it:** Engineers often look at historical logs and miss that the issue is intermittent and requires observation under specific conditions. Following logs while reproducing the issue changes the investigation from archaeological to observational. **Production scenario:** An endpoint occasionally returns 500 errors that do not appear in the application's structured error logging. Following logs while triggering the endpoint manually reveals that a third-party library writes unformatted error output to stderr which the application does not capture or forward. Piping `2>&1` into the grep reveals errors that the monitoring system never sees. **Pro tip:** Pipe to `grep` for noise reduction, but use `2>&1` first to merge stderr — many applications write critical diagnostics to stderr that log forwarding systems miss. --- ### 8. `docker ps --format` Standard `docker ps` output is dense and hard to read when running many containers. The `--format` flag with Go template syntax produces exactly the columns you need, in the order you need them, making multi-container environments scannable in seconds. ```bash docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" docker ps --format "table {{.Names}}\t{{.Image}}\t{{.RunningFor}}\t{{.Status}}" docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.CreatedAt}}" docker ps --filter "status=exited" --format "table {{.Names}}\t{{.Status}}" ``` ![[Pasted image 20260315101504.png|1200]] **Why engineers underestimate it:** Raw `docker ps` output contains every column, many irrelevant during an investigation. Engineers visually parse wide table output and miss status details. Formatted output reduces the cognitive load of scanning container state under pressure. **Production scenario:** A Kubernetes node has 22 containers running. An incident requires quickly identifying which containers have restarted recently. Raw `docker ps` requires reading across wide rows. A formatted query showing only names, status, and running time makes the two containers with "Restarting" status visible immediately. **Pro tip:** Save your most-used format strings as shell aliases. A `dps` alias with your preferred format turns a multi-second parsing exercise into an instant status read. --- ### 9. `docker diff` `docker diff` shows every filesystem change made to a running container since it started: files added (A), changed (C), or deleted (D). This command answers the question that matters most for debugging unexpected container behavior: what has this container done to its filesystem? ```bash docker diff api-container docker diff api-container | grep "^A" docker diff api-container | grep -v "^C /tmp" ``` ![[Pasted image 20260315101650.png|1200]] **Why engineers underestimate it:** Engineers assume containers are immutable because they were built from immutable images. Containers are not immutable at runtime — applications write logs, create temp files, modify configuration, and sometimes write to paths they should not. `docker diff` makes this visible without entering the container. **Production scenario:** A container passes security scanning but fails a compliance audit. The auditor asks which files changed at runtime. `docker diff` reveals that the application writes processed data to `/var/lib/app/cache` inside the container rather than to the mounted volume. This data is ephemeral, lost on restart, and constitutes a compliance violation because it is not persisted to the audited storage backend. **Pro tip:** Use `docker diff` in container validation pipelines to assert that no unexpected filesystem changes occur during a known operation. Any change to paths outside designated writable directories is a finding. --- ### 10. `docker history` `docker history` shows how a Docker image was built: each layer, the command that created it, and its size. This is the primary tool for understanding why an image is larger than expected and which build step is the culprit. ```bash docker history api-image:latest docker history --no-trunc api-image:latest docker history --format "table {{.CreatedBy}}\t{{.Size}}" api-image:latest ``` ![[Pasted image 20260315101952.png|1200]] **Why engineers underestimate it:** Image size feels like a build concern, not an operations concern. In practice, oversized images slow deployments, consume registry storage, consume node disk space, and increase pull times during scaling events. `docker history` is the diagnostic tool that reveals which Dockerfile instruction is responsible. **Production scenario:** An image grows from 180MB to 1.4GB after a developer adds a data processing feature. `docker history` reveals a `RUN pip install` layer that is 800MB because it installs every scientific computing library rather than a requirements file limited to production dependencies. A second finding: a `COPY . .` instruction copies the entire repository including test data, adding another 400MB. Both are invisible without `docker history`. **Pro tip:** Track image layer sizes across builds by saving `docker history` output in CI artifacts. Size regressions become visible at build time rather than at deployment time. --- ## 7 Real Production Debugging Scenarios --- ### Scenario 1: Container Keeps Restarting Every Few Minutes **Symptoms:** `docker ps` shows a container with restart count increasing. The application appears to start, serve traffic briefly, then restart again. Logs from the last run seem normal. **Commands that solve it:** Start with `docker logs --since=5m --timestamps` to capture output from the most recent execution window. Look for the last lines before the container stopped — this is where the exit reason lives. Then `docker inspect --format='{{.State.ExitCode}}'` to get the exit code. Exit code 137 means OOMKill. Exit code 1 means application error. Exit code 143 means SIGTERM received. Then `docker stats --no-stream` to check memory usage trajectory. **Resolution path:** Exit code 137 leads to memory limit investigation. Exit code 1 leads to application log analysis. The restart count and timing often reveal a pattern — immediate restarts suggest startup failure, delayed restarts suggest memory leak or external dependency timeout. --- ### Scenario 2: CI/CD Runner Out of Disk Space **Symptoms:** Build jobs start failing with "no space left on device." The host disk reports 95% utilization. Recent deployments have not increased artifact size significantly. **Commands that solve it:** `docker system df` immediately quantifies how much Docker is consuming versus the filesystem total. `docker system df -v` lists every image, container, and volume individually. The culprit is almost always either dangling image layers from builds that never cleaned up, build cache from multi-stage builds, or stopped test containers that were never removed. **Resolution path:** `docker image prune --filter "until=48h"` removes images older than 48 hours. `docker builder prune` clears build cache. `docker container prune` removes stopped containers. Implement these as post-build cleanup steps in the CI pipeline to prevent recurrence. --- ### Scenario 3: API Slows Down During Peak Traffic **Symptoms:** P99 latency increases under load. The application appears healthy. No error logs visible. **Commands that solve it:** `docker stats` running while load is applied reveals whether the container is CPU-throttled or memory-constrained. A container hitting its CPU limit will show 100% of its allocated CPU with no headroom — Docker's cgroup enforcement throttles it. This appears as latency increase with no error output because the application is simply waiting for CPU time it cannot get. **Resolution path:** CPU percentage near the limit means the container needs higher CPU limits or the workload needs horizontal scaling. Memory approaching the limit means a leak or under-provisioned memory. `docker inspect --format='{{.HostConfig.CpuQuota}}'` shows the configured CPU quota. --- ### Scenario 4: Environment Variables Behave Differently in Production **Symptoms:** An application uses correct configuration in staging but incorrect values in production. The Compose file and deployment manifests look identical. **Commands that solve it:** `docker inspect --format='{{range .Config.Env}}{{println .}}{{end}}'` on both the staging and production containers shows every environment variable the container actually received. Compare outputs. Variables expected from a `.env` file, a secrets manager, or a Compose override file that was not applied to production will be absent or show default values. **Resolution path:** Missing variables are a deployment pipeline problem. Incorrect values indicate a secrets injection failure or an environment-specific override that was not ported between environments. `docker inspect` makes the actual runtime state authoritative versus the assumed configured state. --- ### Scenario 5: Container Works Locally, Fails in Staging **Symptoms:** A developer's local build runs correctly. The same image fails in staging with a file not found or permission error. **Commands that solve it:** `docker exec` into the staging container and verify the filesystem structure matches expectations. `docker diff` shows runtime changes. `docker inspect` compares volume mounts between local and staging — a common cause is that local development uses bind mounts that overlay the container filesystem with the host directory, while staging uses named volumes or no mounts, exposing the actual container filesystem content. **Resolution path:** Usually a path assumption in the application that is satisfied by a local bind mount but not by the container-only filesystem. Fix the Dockerfile to copy the required files rather than depending on mount presence. --- ### Scenario 6: Docker Image Becomes Huge After a Small Change **Symptoms:** An image that was 250MB becomes 1.1GB after a developer adds a feature branch. CI artifact storage costs increase. Deployment times increase. **Commands that solve it:** `docker history api-image:latest` shows every layer and its size. Compare with `docker history` on the previous tag to identify which layer expanded. The offending instruction is usually visible immediately — a `COPY` that copies more than intended, an `apt-get install` without cleanup, or a `pip install` that pulls transitive dependencies not previously included. **Resolution path:** Fix the Dockerfile to add `--no-install-recommends` to apt installs, clean package caches in the same layer, use `.dockerignore` to exclude large directories from COPY context, and split multi-stage builds to separate build dependencies from runtime images. --- ### Scenario 7: Container Silently Modifies Files at Runtime **Symptoms:** A stateless application appears to accumulate state. Restarts clear the issue temporarily. Compliance audit flags unexpected file writes. **Commands that solve it:** `docker diff` immediately lists every file the container has written, modified, or deleted since start. Cross-reference against the application's expected write paths. Any write outside of designated volume mount points or `/tmp` is unexpected for a stateless application. **Resolution path:** Map all write paths from `docker diff` output to application code. Files written to the container filesystem are lost on restart and represent either a caching layer that should be in a volume or a bug where the application writes to the wrong path. --- ## The Hidden Architecture of Docker Understanding why these commands work requires understanding what Docker is actually doing underneath.![[docker-architecture.svg|1200]] **Images and Layers:** A Docker image is a stack of read-only filesystem layers, each representing one instruction in the Dockerfile. When you run `docker history`, you see these layers. The overlay filesystem merges them into a single directory tree visible to the container process. **Overlay Filesystem:** When a container starts, Docker adds a thin read-write layer on top of the image's read-only layers. This is what `docker diff` reports — every write the container makes goes into this layer. This is also why multiple containers can share the same base image layers without copying them: they each have their own R/W layer, but the underlying read-only layers are shared on disk. **Volumes:** Volumes bypass the overlay filesystem entirely. Data written to a volume mount goes directly to a host path managed by Docker, surviving container restarts. Data written anywhere else in the container filesystem goes into the R/W layer and disappears on restart. `docker inspect` shows exactly which paths are volume-mounted versus overlay-filesystem-backed. **Container Logs:** Docker captures stdout and stderr from container processes and writes them to JSON files on the host under `/var/lib/docker/containers`. `docker logs` reads these files. Log rotation is not enabled by default — containers that write extensively to stdout will fill disk through log accumulation until rotation is configured. **Namespaces:** Each container runs in isolated Linux namespaces: PID namespace (the container sees its own process tree), network namespace (its own network interfaces), mount namespace (its own filesystem view), and UTS namespace (its own hostname). `docker exec` enters these namespaces, which is why commands run inside the container see different network, filesystem, and process state than the host. **cgroups:** Resource limits — CPU, memory, block I/O — are enforced through cgroups. When a container hits its memory limit, the kernel OOM killer terminates the process with exit code 137. When a container hits its CPU quota, its processes are throttled without error — they simply receive CPU time at a reduced rate. `docker stats` reads cgroup accounting data, which is why it shows accurate real-time resource consumption. --- ## Debugging Workflow Used by Experienced Engineers ![[docker-troubleshooting-workflow.svg|1200]] The workflow above is not a rigid procedure — it is an ordered investigation that progressively narrows the problem space without skipping ahead to assumptions. **Phase 1 — Detect (Steps 1-3):** Establish the container's state, scope the logs to the incident window, and check live resource consumption. These three steps take under two minutes and eliminate 60% of common causes. **Phase 2 — Analyze (Steps 4-6):** Inspect runtime configuration, enter the container to verify runtime state, and check filesystem changes. These steps address configuration drift and runtime behavior that is invisible from the outside. **Phase 3 — Infrastructure (Steps 7-8):** Check disk consumption and image layer structure. These address the systemic issues — disk exhaustion, image bloat — that cause intermittent failures without obvious application-level symptoms. This sequence reduces Mean Time To Resolution not by being faster at each step, but by running steps in an order that surfaces the most common causes earliest. An engineer who jumps directly to `docker exec` without first checking `docker stats` might spend twenty minutes debugging application logic while a memory limit is causing the container to OOMKill every eight minutes. --- ## 10 Additional Docker Commands Worth Knowing Beyond the core ten, these commands address specific scenarios that arise in production environments: **`docker events`** streams real-time Docker daemon events: container starts, stops, kills, OOM events, image pulls, and network operations. Pipe to `grep oom` during a suspected OOM incident to see kills as they happen. **`docker top container-name`** lists the processes running inside a container from the host's perspective, including PIDs from the host namespace. Useful for attaching external profiling tools to container processes. **`docker cp`** copies files between containers and the host filesystem without entering the container. Use it to extract application logs, configuration files, or crash dumps from containers without modifying their filesystem state. **`docker network inspect`** shows the complete configuration of a Docker network: connected containers, their IP addresses, the driver, and gateway configuration. The first diagnostic step for container-to-container connectivity failures. **`docker system df -v`** provides verbose output listing every image, container, and volume individually with size. Use this to identify specific large objects rather than category totals. **`docker container ls --size`** adds a SIZE column to container listings showing both the container's R/W layer size and the total image size. Containers with large R/W layers are candidates for `docker diff` investigation. **`docker image inspect image:tag`** provides image metadata analogous to `docker inspect` for containers: layer digests, environment variables baked into the image, exposed ports, and the entrypoint command. **`docker buildx imagetools inspect image:tag`** shows multi-platform image manifest data — useful when debugging architecture-specific failures where an image built for amd64 is being pulled on arm64. --- ## Best Practices for Production Docker Usage **Treat containers as immutable.** Anything that needs to persist — logs, uploads, application data — goes on a volume. Anything written inside the container filesystem is temporary by definition. Design applications with this constraint as a requirement, not an afterthought. `docker diff` should show minimal changes for a correctly designed container. **Configure log rotation.** Docker's default JSON log driver has no rotation enabled. Set `max-size` and `max-file` limits in the Docker daemon configuration or per-container. Containers that write extensively to stdout will fill host disk through log accumulation without rotation. **Implement active disk management.** Set up automated cleanup of dangling images, stopped containers, and aged build cache as part of the host maintenance routine. On CI servers, run cleanup after every build job. On production hosts, run weekly with conservative age filters. **Set resource limits explicitly.** Every production container should have explicit memory and CPU limits. Containers without limits can consume unbounded resources and affect neighboring containers on the same host. Use `docker inspect` in deployment validation to assert that limits are set before traffic routing. **Monitor the Docker host separately from containers.** Container metrics show what containers see. Host metrics show what Docker is doing to the host: disk I/O from the overlay filesystem, memory pressure from container R/W layers, network throughput from bridge interfaces. Both are necessary for complete visibility. **Debug from the outside before going inside.** `docker logs`, `docker stats`, and `docker inspect` are non-invasive. `docker exec` modifies the container's state by adding a process. When investigating security incidents or compliance violations, non-invasive diagnostics preserve evidence. Enter the container only when external observation is insufficient. **Validate container configuration programmatically.** `docker inspect` with `--format` makes container configuration assertions scriptable. Integrate configuration validation into deployment pipelines to catch missing environment variables, incorrect restart policies, and absent volume mounts before they cause production incidents. --- ## Conclusion Docker is not a black box. It is a well-instrumented system with a diagnostic surface that most engineers leave unexplored until an incident forces them to find it. The engineers who resolve Docker incidents fastest are not the ones who know the application best. They are the ones who run `docker stats` before they start speculating, `docker inspect` before they assume the configuration is correct, and `docker logs --since` before they scroll through hours of output looking for a signal. Every production incident involving a containerized application has a trail. The container's exit code tells you how it died. The logs from the incident window tell you what it was doing. The resource stats tell you what it was consuming. The filesystem diff tells you what it changed. The image history tells you what was baked into it. The diagnostic toolkit is already there. The commands exist in every Docker installation. The discipline is running them in order, before assumptions, before guesses, before escalations. ``` docker ps → docker logs --since → docker stats → docker inspect → docker exec → docker system df ``` Six commands. In order. Before anything else. That sequence has ended more 2am incidents than any amount of familiarity with the application code underneath. --- *Written from production experience with Docker at scale. All scenarios are based on real incident patterns observed across containerized environments in fintech, SaaS, and platform engineering contexts.* _Created by Vladimiras Levinas | Lead DevOps Engineer_ _© 2026 | Built with production experience from real-world GitOps automation_ _Article maintained at [doc.thedevops.dev](https://doc.thedevops.dev/) | Last updated: March 2026_