Skip to content

Portal | Level: L1: Foundations | Topics: Docker / Containers, Container Runtimes | Domain: Kubernetes

Scenario: Docker Container Won't Start in Production

The Prompt

"We push a new Docker image to production and the container keeps crashing. It works fine in the dev environment. How do you troubleshoot this?"

Initial Report

CI/CD notification: "Deployment to production failed. Container api exited with code 1 after 3 restart attempts. Rollback triggered automatically. The same image passes all tests in staging."

Constraints

  • Time pressure: Automatic rollback saved production, but the team needs this fix deployed today.
  • Environment difference: Dev runs Docker Compose locally, staging is a smaller K8s cluster, production is EKS with strict security policies.

Observable Evidence

  • Exit code: 1 (application error)
  • Logs: Error: EACCES: permission denied, open '/app/data/cache.json'
  • Image diff: Same image tag, same SHA in all environments
  • Staging vs prod: Staging pods run as root, production enforces runAsNonRoot: true

Expected Investigation Path

# 1. Check the container logs
kubectl logs deploy/api --previous -n prod

# 2. Check the security context
kubectl get deploy api -n prod -o yaml | grep -A10 securityContext

# 3. Check what user the container runs as
docker inspect api:v2.5.0 --format='User: {{.Config.User}}'
# Or in K8s:
kubectl exec deploy/api -- id

# 4. Check filesystem permissions in the image
docker run --rm --entrypoint="" api:v2.5.0 ls -la /app/data/

# 5. Compare staging vs prod pod specs
kubectl get deploy api -n staging -o yaml > /tmp/staging.yaml
kubectl get deploy api -n prod -o yaml > /tmp/prod.yaml
diff /tmp/staging.yaml /tmp/prod.yaml

Strong Answer

"The key insight is that the same image behaves differently across environments. The error EACCES: permission denied combined with production enforcing runAsNonRoot tells me this is a file permissions issue.

In dev and staging, the container runs as root, so it can write anywhere. In production with runAsNonRoot: true and readOnlyRootFilesystem: true, the non-root user can't write to /app/data/cache.json.

I'd fix this in layers:

  1. Immediate fix: Add an emptyDir volume mounted at /app/data/ so the container has a writable directory, and ensure the fsGroup in the pod security context matches the app user's GID.

  2. Better fix: Update the Dockerfile to create the data directory with correct ownership:

    RUN mkdir -p /app/data && chown 1000:1000 /app/data
    USER 1000
    

  3. Best fix: Make the app configurable — write cache to /tmp or a volume mount, not a hardcoded path. This makes the image work regardless of security context.

I'd also add a policy or CI check that tests images with the production security context in staging, so this class of issue is caught before it reaches prod."

Red Flags (Weak Answers)

  • Suggesting to disable runAsNonRoot in production
  • Not recognizing the environment difference as the root cause
  • Only looking at application code, not the runtime environment
  • Not understanding how securityContext, fsGroup, and file permissions interact
  • Not suggesting a preventive measure

Follow-ups

  1. "What's the difference between runAsUser and fsGroup?"
  2. "What if the image uses readOnlyRootFilesystem: true? How do you handle temp files?"
  3. "How would you ensure staging matches production's security settings?"
  4. "The container also needs to write to /tmp — what's the best approach?"

Key Concepts Tested

  • Container security model: Non-root, read-only filesystem, capabilities
  • Environment parity: Dev/staging/prod consistency
  • Debugging methodology: Logs → config comparison → root cause
  • Docker best practices: USER instruction, multi-stage builds, file ownership
  • Kubernetes security context: runAsNonRoot, readOnlyRootFilesystem, fsGroup
  • Defense in depth: Don't weaken security to fix bugs

Wiki Navigation