Skip to content

Thinking Out Loud: Docker

A senior SRE's internal monologue while working through a real Docker issue. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

A CI pipeline is failing with "no space left on device" during docker build. The build server has a 200GB disk that should be plenty. Builds were working fine last week.

The Monologue

"No space left on device" on a 200GB disk for Docker builds. This is almost always Docker's disk usage growing without cleanup. Let me check the actual disk state.

df -h /
df -h /var/lib/docker

Root partition: 98% used. /var/lib/docker is 165GB. There it is. Docker is consuming most of the disk. Let me see what's taking the space.

docker system df

Images: 89GB (87 images), Containers: 2.3GB (42 containers, 38 stopped), Build Cache: 61GB, Volumes: 12GB. The build cache at 61GB is the biggest surprise. But 89GB of images is also a lot — we probably have dozens of old image versions hanging around.

Let me check for stopped containers first — those are the easiest win.

docker ps -a --filter status=exited --format "table {{.ID}}\t{{.Image}}\t{{.Status}}\t{{.Size}}" | head -20

38 stopped containers from old CI runs. Nobody cleaned them up. Let me check for dangling images too.

docker images --filter dangling=true -q | wc -l
docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}\t{{.CreatedSince}}" | sort -k3 | head -20

32 dangling images (no tag, orphaned layers from builds) and a bunch of tagged images from months ago.

Mental Model: Docker Disk Layers

Docker disk usage has four layers: containers (running + stopped), images (used + dangling + unreferenced), build cache (BuildKit layer cache), and volumes (named + anonymous). Each layer has different cleanup commands. docker system prune is the nuclear option — it cleans all stopped containers, dangling images, unused networks, and optionally build cache. But it doesn't touch used images or named volumes. For a CI server, aggressive pruning is safe because nothing needs to persist.

On a CI server, I can be aggressive with cleanup. Nothing on this server is production state — everything should be rebuildable.

# Clean up stopped containers, dangling images, unused networks, and build cache
docker system prune -af --volumes

Wait — --volumes will also remove named volumes. Let me check if there are any volumes that need to persist.

docker volume ls --format "{{.Name}}\t{{.Driver}}" | head -10

All anonymous volumes (random hex names) from old container runs. Nothing named that needs to persist. Safe to prune with --volumes.

docker system prune -af --volumes

Reclaimed: 158GB. Let me verify.

df -h /var/lib/docker
docker system df

Down to 7GB. Perfect. But this will happen again if we don't set up automated cleanup.

Mental Model: CI Servers Need Aggressive Automated Cleanup

On CI/CD build servers, Docker artifacts are transient by definition. Images, containers, and build cache accumulate without natural cleanup because CI runs create them but never remove them. Every CI server should have either: (1) a cron job that prunes Docker artifacts older than N days, (2) a post-build cleanup step in the CI pipeline, or (3) ephemeral build workers that are destroyed after each run.

Let me set up a daily cleanup cron.

cat <<'EOF' | sudo tee /etc/cron.d/docker-cleanup
# Clean Docker artifacts older than 24 hours, daily at 3 AM
0 3 * * * root docker system prune -af --filter "until=24h" >> /var/log/docker-cleanup.log 2>&1
# Also clean build cache older than 7 days
0 3 * * * root docker builder prune -af --filter "until=168h" >> /var/log/docker-cleanup.log 2>&1
EOF

The --filter "until=24h" keeps recent images (last 24 hours) while cleaning everything older. The build cache filter is more generous (7 days) because build cache helps CI speed.

Now, let me also look at why the build cache is so big. 61GB of build cache suggests either no multi-stage builds (so intermediate layers are huge) or the Dockerfiles aren't optimizing layer caching.

docker builder du --verbose 2>/dev/null | head -20

Large cache entries from... COPY . . steps. That's the classic Docker layer caching mistake. Every COPY . . creates a new cache entry for the entire source tree. If the source tree is large (100MB+) and you build 50 times a day, that's 5GB per day of build cache just from source copies.

Let me check one of the project Dockerfiles.

cat /ci/projects/api-service/Dockerfile
FROM python:3.11
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["python", "main.py"]

Classic beginner Dockerfile. Problems: 1. COPY . . before pip install — every code change invalidates the pip install cache 2. No multi-stage build — the final image has build tools 3. Running as root (no USER directive) 4. Using full python:3.11 instead of slim

If I fix the layer ordering, the build cache becomes much more effective:

FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .

FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY --from=builder /app .
RUN useradd -r -s /sbin/nologin appuser
USER appuser
CMD ["python", "main.py"]

Key changes: COPY requirements.txt first (dependencies change rarely, so this layer is cached). Multi-stage build (the final image doesn't have pip or build tools). Run as non-root user. Use slim base image.

But that's a code change for the dev team. For the infrastructure problem (disk filling up), the cron job will handle it. Let me also set Docker's storage driver max size as a safety net.

cat /etc/docker/daemon.json
{
  "storage-driver": "overlay2",
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Log rotation is set (good), but there's no storage limit. Let me also check that the build cache has a max size.

cat ~/.docker/buildx/buildkitd.default.toml 2>/dev/null || echo "No BuildKit config"

No BuildKit config. Let me set one with a cache limit.

mkdir -p ~/.docker/buildx
cat <<'EOF' > ~/.docker/buildx/buildkitd.default.toml
[worker.oci]
  max-parallelism = 4
  gc = true
  gckeepstorage = 20000000000  # 20GB max build cache
  [[worker.oci.gcpolicy]]
    keepBytes = 20000000000
    keepDuration = 604800  # 7 days
    all = true
EOF

Now BuildKit will automatically garbage collect build cache when it exceeds 20GB or entries are older than 7 days.

Let me verify the build server is healthy and CI can run again.

docker build -t test-build /ci/projects/api-service/ --no-cache
echo $?

Build succeeded. CI is unblocked.

What Made This Senior-Level

Junior Would... Senior Does... Why
Add a bigger disk Clean up Docker artifacts and set up automated cleanup More disk just delays the same problem — Docker will fill any disk without cleanup
Run docker system prune manually this one time Set up a cron job AND configure BuildKit GC limits for permanent cleanup Manual cleanup is a one-time fix; automated cleanup prevents recurrence
Not investigate WHY the build cache was so large Look at Dockerfiles and identify layer caching anti-patterns Fixing the Dockerfile layer ordering makes the build cache more effective AND smaller
Not set log rotation for Docker containers Verify (and set) JSON log driver max-size to prevent log-based disk fill Container logs are another unbounded growth vector that fills disks

Key Heuristics Used

  1. Docker Disk Has Four Layers: Containers, images, build cache, and volumes each need different cleanup strategies. docker system df shows the breakdown.
  2. CI Servers Need Automated Cleanup: Build artifacts are transient. Set up daily pruning and BuildKit GC limits to prevent accumulation.
  3. Dockerfile Layer Ordering Matters: COPY dependencies before source code. This makes layer caching effective and reduces build cache size.

Cross-References

  • Primer — Docker image layers, build cache, and storage driver mechanics
  • Street Ops — Docker cleanup commands, system df, and build optimization
  • Footguns — No Docker cleanup on CI servers, COPY-before-dependencies anti-pattern, and running containers as root