Portal | Level: L1: Foundations | Topics: Docker / Containers, Container Runtimes | Domain: Kubernetes

Scenario: Docker Container Won't Start in Production¶

The Prompt¶

"We push a new Docker image to production and the container keeps crashing. It works fine in the dev environment. How do you troubleshoot this?"

Initial Report¶

CI/CD notification: "Deployment to production failed. Container api exited with code 1 after 3 restart attempts. Rollback triggered automatically. The same image passes all tests in staging."

Constraints¶

Time pressure: Automatic rollback saved production, but the team needs this fix deployed today.
Environment difference: Dev runs Docker Compose locally, staging is a smaller K8s cluster, production is EKS with strict security policies.

Observable Evidence¶

Exit code: 1 (application error)
Logs: Error: EACCES: permission denied, open '/app/data/cache.json'
Image diff: Same image tag, same SHA in all environments
Staging vs prod: Staging pods run as root, production enforces runAsNonRoot: true

Expected Investigation Path¶

# 1. Check the container logs
kubectl logs deploy/api --previous -n prod

# 2. Check the security context
kubectl get deploy api -n prod -o yaml | grep -A10 securityContext

# 3. Check what user the container runs as
docker inspect api:v2.5.0 --format='User: {{.Config.User}}'
# Or in K8s:
kubectl exec deploy/api -- id

# 4. Check filesystem permissions in the image
docker run --rm --entrypoint="" api:v2.5.0 ls -la /app/data/

# 5. Compare staging vs prod pod specs
kubectl get deploy api -n staging -o yaml > /tmp/staging.yaml
kubectl get deploy api -n prod -o yaml > /tmp/prod.yaml
diff /tmp/staging.yaml /tmp/prod.yaml

Strong Answer¶

"The key insight is that the same image behaves differently across environments. The error EACCES: permission denied combined with production enforcing runAsNonRoot tells me this is a file permissions issue.

In dev and staging, the container runs as root, so it can write anywhere. In production with runAsNonRoot: true and readOnlyRootFilesystem: true, the non-root user can't write to /app/data/cache.json.

I'd fix this in layers:

Immediate fix: Add an emptyDir volume mounted at /app/data/ so the container has a writable directory, and ensure the fsGroup in the pod security context matches the app user's GID.
Better fix: Update the Dockerfile to create the data directory with correct ownership:
```
RUN mkdir -p /app/data && chown 1000:1000 /app/data
USER 1000
```
Best fix: Make the app configurable — write cache to /tmp or a volume mount, not a hardcoded path. This makes the image work regardless of security context.

I'd also add a policy or CI check that tests images with the production security context in staging, so this class of issue is caught before it reaches prod."

Red Flags (Weak Answers)¶

Suggesting to disable runAsNonRoot in production
Not recognizing the environment difference as the root cause
Only looking at application code, not the runtime environment
Not understanding how securityContext, fsGroup, and file permissions interact
Not suggesting a preventive measure

Follow-ups¶

"What's the difference between runAsUser and fsGroup?"
"What if the image uses readOnlyRootFilesystem: true? How do you handle temp files?"
"How would you ensure staging matches production's security settings?"
"The container also needs to write to /tmp — what's the best approach?"

Key Concepts Tested¶

Container security model: Non-root, read-only filesystem, capabilities
Environment parity: Dev/staging/prod consistency
Debugging methodology: Logs → config comparison → root cause
Docker best practices: USER instruction, multi-stage builds, file ownership
Kubernetes security context: runAsNonRoot, readOnlyRootFilesystem, fsGroup
Defense in depth: Don't weaken security to fix bugs

Containers Deep Dive (Topic Pack, L1) — Container Runtimes, Docker / Containers
Deep Dive: Containers How They Really Work (deep_dive, L2) — Container Runtimes, Docker / Containers
AWS ECS (Topic Pack, L2) — Docker / Containers
Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Docker / Containers
Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Docker / Containers
Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Docker / Containers
Container Images (Topic Pack, L1) — Docker / Containers
Container Runtime Drills (Drill, L2) — Container Runtimes
Container Runtime Flashcards (CLI) (flashcard_deck, L1) — Container Runtimes
Deep Dive: Docker Image Internals (deep_dive, L2) — Docker / Containers

Scenario: Docker Container Won't Start in Production¶

The Prompt¶

Initial Report¶

Constraints¶

Observable Evidence¶

Expected Investigation Path¶

Strong Answer¶

Red Flags (Weak Answers)¶

Follow-ups¶

Key Concepts Tested¶

Wiki Navigation¶

Pages that link here¶

Scenario: Docker Container Won't Start in Production¶

The Prompt¶

Initial Report¶

Constraints¶

Observable Evidence¶

Expected Investigation Path¶

Strong Answer¶

Red Flags (Weak Answers)¶

Follow-ups¶

Key Concepts Tested¶

Wiki Navigation¶

Related Content¶

Pages that link here¶