Portal | Level: L1: Foundations | Topics: Docker / Containers, Container Runtimes | Domain: Kubernetes
Scenario: Docker Container Won't Start in Production¶
The Prompt¶
"We push a new Docker image to production and the container keeps crashing. It works fine in the dev environment. How do you troubleshoot this?"
Initial Report¶
CI/CD notification: "Deployment to production failed. Container
apiexited with code 1 after 3 restart attempts. Rollback triggered automatically. The same image passes all tests in staging."
Constraints¶
- Time pressure: Automatic rollback saved production, but the team needs this fix deployed today.
- Environment difference: Dev runs Docker Compose locally, staging is a smaller K8s cluster, production is EKS with strict security policies.
Observable Evidence¶
- Exit code: 1 (application error)
- Logs:
Error: EACCES: permission denied, open '/app/data/cache.json' - Image diff: Same image tag, same SHA in all environments
- Staging vs prod: Staging pods run as root, production enforces
runAsNonRoot: true
Expected Investigation Path¶
# 1. Check the container logs
kubectl logs deploy/api --previous -n prod
# 2. Check the security context
kubectl get deploy api -n prod -o yaml | grep -A10 securityContext
# 3. Check what user the container runs as
docker inspect api:v2.5.0 --format='User: {{.Config.User}}'
# Or in K8s:
kubectl exec deploy/api -- id
# 4. Check filesystem permissions in the image
docker run --rm --entrypoint="" api:v2.5.0 ls -la /app/data/
# 5. Compare staging vs prod pod specs
kubectl get deploy api -n staging -o yaml > /tmp/staging.yaml
kubectl get deploy api -n prod -o yaml > /tmp/prod.yaml
diff /tmp/staging.yaml /tmp/prod.yaml
Strong Answer¶
"The key insight is that the same image behaves differently across environments. The error EACCES: permission denied combined with production enforcing runAsNonRoot tells me this is a file permissions issue.
In dev and staging, the container runs as root, so it can write anywhere. In production with runAsNonRoot: true and readOnlyRootFilesystem: true, the non-root user can't write to /app/data/cache.json.
I'd fix this in layers:
-
Immediate fix: Add an
emptyDirvolume mounted at/app/data/so the container has a writable directory, and ensure thefsGroupin the pod security context matches the app user's GID. -
Better fix: Update the Dockerfile to create the data directory with correct ownership:
-
Best fix: Make the app configurable — write cache to
/tmpor a volume mount, not a hardcoded path. This makes the image work regardless of security context.
I'd also add a policy or CI check that tests images with the production security context in staging, so this class of issue is caught before it reaches prod."
Red Flags (Weak Answers)¶
- Suggesting to disable
runAsNonRootin production - Not recognizing the environment difference as the root cause
- Only looking at application code, not the runtime environment
- Not understanding how
securityContext,fsGroup, and file permissions interact - Not suggesting a preventive measure
Follow-ups¶
- "What's the difference between
runAsUserandfsGroup?" - "What if the image uses
readOnlyRootFilesystem: true? How do you handle temp files?" - "How would you ensure staging matches production's security settings?"
- "The container also needs to write to
/tmp— what's the best approach?"
Key Concepts Tested¶
- Container security model: Non-root, read-only filesystem, capabilities
- Environment parity: Dev/staging/prod consistency
- Debugging methodology: Logs → config comparison → root cause
- Docker best practices: USER instruction, multi-stage builds, file ownership
- Kubernetes security context:
runAsNonRoot,readOnlyRootFilesystem,fsGroup - Defense in depth: Don't weaken security to fix bugs
Wiki Navigation¶
Related Content¶
- Containers Deep Dive (Topic Pack, L1) — Container Runtimes, Docker / Containers
- Deep Dive: Containers How They Really Work (deep_dive, L2) — Container Runtimes, Docker / Containers
- AWS ECS (Topic Pack, L2) — Docker / Containers
- Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Docker / Containers
- Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Docker / Containers
- Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Docker / Containers
- Container Images (Topic Pack, L1) — Docker / Containers
- Container Runtime Drills (Drill, L2) — Container Runtimes
- Container Runtime Flashcards (CLI) (flashcard_deck, L1) — Container Runtimes
- Deep Dive: Docker Image Internals (deep_dive, L2) — Docker / Containers
Pages that link here¶
- Container Runtime Debugging - Skill Check
- Container Runtime Debugging Drills
- Containers - How They Really Work
- Containers Deep Dive
- Containers Deep Dive - Primer
- Docker - Skill Check
- Docker Drills
- Docker Image Internals
- Interview Gauntlet: Container Image Build and Distribution Pipeline
- Interview Gauntlet: Flaky CI Build
- Interview Scenarios
- Master Curriculum: 40 Weeks
- Track: Containers