AWS ECS - Street-Level Ops¶

Real-world workflows for operating ECS services and debugging production issues.

Quick Service Health Check¶

# Service status — the first thing you look at
aws ecs describe-services --cluster production --services api-service \
  --query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount,status:status}'

# Output:
# {
#     "desired": 4,
#     "running": 4,
#     "pending": 0,
#     "status": "ACTIVE"
# }

# Service events — shows recent scheduling decisions and failures
aws ecs describe-services --cluster production --services api-service \
  --query 'services[0].events[:10].[createdAt,message]' --output table

# Running tasks with IPs and health status
aws ecs list-tasks --cluster production --service-name api-service --desired-status RUNNING \
  --query 'taskArns' --output text | tr '\t' '\n' | \
  xargs -I{} aws ecs describe-tasks --cluster production --tasks {} \
  --query 'tasks[*].{id:taskArn,status:lastStatus,health:healthStatus,ip:containers[0].networkInterfaces[0].privateIpv4Address}'

Debugging Task Placement Failures¶

When the service events log shows "unable to place a task" repeatedly, work through this checklist:

# 1. Check capacity — do the subnets have available IPs?
aws ec2 describe-subnets --subnet-ids subnet-abc subnet-def \
  --query 'Subnets[*].{id:SubnetId,az:AvailabilityZone,available:AvailableIpAddressCount}'

# 2. For awsvpc mode, check ENI limits on EC2 instances
aws ecs describe-container-instances --cluster production \
  --container-instances $(aws ecs list-container-instances --cluster production --query 'containerInstanceArns' --output text) \
  --query 'containerInstances[*].{id:ec2InstanceId,cpu:remainingResources[?name==`CPU`].integerValue|[0],mem:remainingResources[?name==`MEMORY`].integerValue|[0],status:status}'

# 3. Check security group allows required traffic
aws ec2 describe-security-groups --group-ids sg-123 \
  --query 'SecurityGroups[0].IpPermissions'

# 4. Check placement constraints (if any)
aws ecs describe-task-definition --task-definition api-service:42 \
  --query 'taskDefinition.placementConstraints'

# 5. For Fargate, verify the task CPU/memory is a valid combination
aws ecs describe-task-definition --task-definition api-service:42 \
  --query 'taskDefinition.{cpu:cpu,memory:memory}'

Common placement failure causes: - No IPs in subnet: awsvpc mode needs one IP per task. Small /24 subnets run out fast with many tasks. - ENI limit on EC2: Each awsvpc task uses an ENI. c5.large supports ~10 ENIs total. - Security group misconfiguration: Task cannot reach the load balancer health check endpoint. - Capacity provider exhausted: Fargate capacity temporarily unavailable in the AZ.

Default trap: Each Fargate task in awsvpc mode gets its own ENI and private IP. A /24 subnet has only 251 usable IPs. If you run 200 tasks across two subnets, you can silently exhaust IPs. Always use /20 or larger subnets for ECS workloads, and monitor AvailableIpAddressCount on your subnets.

Service Not Reaching Steady State¶

The service keeps cycling: starting tasks, failing health checks, killing them, starting new ones.

# Check service events for the cycling pattern
aws ecs describe-services --cluster production --services api-service \
  --query 'services[0].events[:20].[createdAt,message]' --output table

# Look for patterns like:
# "service api-service has reached a steady state" (good)
# "service api-service was unable to place a task" (placement issue)
# "service api-service has started 1 tasks: task abc123" then
# "service api-service deregistered 1 targets in target-group api-tg" (health check failure)

# Check the stopped task reason
aws ecs describe-tasks --cluster production \
  --tasks arn:aws:ecs:us-east-1:123456789:task/production/abc123 \
  --query 'tasks[0].{reason:stoppedReason,exitCode:containers[0].exitCode,status:lastStatus}'

# Check CloudWatch logs for the container
aws logs get-log-events \
  --log-group-name /ecs/api-service \
  --log-stream-name "api/api/abc123" \
  --limit 50

Most common root causes: 1. Container crashes on startup — check exit code and logs 2. Health check fails — container boots too slowly, health check grace period too short 3. Secrets Manager / SSM pull fails — execution role lacks permissions 4. Image pull fails — ECR permissions, image tag does not exist, or VPC endpoint missing

ECS Exec to Shell into Fargate Tasks¶

# Verify ECS Exec is enabled on the service
aws ecs describe-services --cluster production --services api-service \
  --query 'services[0].deployments[0].serviceConnectConfiguration'

# Check if the specific task has managed agents running
aws ecs describe-tasks --cluster production \
  --tasks arn:aws:ecs:us-east-1:123456789:task/production/abc123 \
  --query 'tasks[0].containers[0].managedAgents'

# Open a shell
aws ecs execute-command \
  --cluster production \
  --task arn:aws:ecs:us-east-1:123456789:task/production/abc123 \
  --container api \
  --interactive \
  --command "/bin/sh"

# If ECS Exec fails, check:
# 1. Task role has ssmmessages:* permissions
# 2. VPC has endpoints for ssmmessages (or NAT gateway for internet)
# 3. Service was updated with --enable-execute-command
# 4. Task was launched AFTER ECS Exec was enabled (old tasks won't have it)

Investigating OOMKilled Containers¶

# Check if the container was OOM killed
aws ecs describe-tasks --cluster production \
  --tasks arn:aws:ecs:us-east-1:123456789:task/production/abc123 \
  --query 'tasks[0].containers[0].{reason:reason,exitCode:exitCode}'

# Output when OOM killed:
# { "reason": "OutOfMemoryError: Container killed due to memory usage", "exitCode": 137 }

# Check what memory was allocated vs what the task definition allows
aws ecs describe-task-definition --task-definition api-service:42 \
  --query 'taskDefinition.containerDefinitions[0].{memoryHard:memory,memorySoft:memoryReservation}'

# Look at CloudWatch Container Insights for memory trends
aws cloudwatch get-metric-statistics \
  --namespace ECS/ContainerInsights \
  --metric-name MemoryUtilized \
  --dimensions Name=ClusterName,Value=production Name=ServiceName,Value=api-service \
  --start-time $(date -u -d '2 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 300 \
  --statistics Average Maximum

Fix: Increase the task definition memory limit, or fix the memory leak in the application. In Fargate, you can only pick from fixed vCPU/memory combos — you may need to bump to the next tier.

Debug clue: Exit code 137 means SIGKILL, which is the OOM killer. But in ECS, you may also see exit code 137 from task stops during deployments. Check stoppedReason to distinguish: "OutOfMemoryError" means OOM, "Essential container in task exited" means another container crashed first, and "Scaling activity" means a scale-in event.

Task Definition Rollback¶

# List recent revisions
aws ecs list-task-definitions --family-prefix api-service --sort DESC --max-items 5

# Update the service to use a previous revision
aws ecs update-service \
  --cluster production \
  --service api-service \
  --task-definition api-service:41

# Wait for the deployment to stabilize
aws ecs wait services-stable --cluster production --services api-service

# Verify
aws ecs describe-services --cluster production --services api-service \
  --query 'services[0].{taskDef:taskDefinition,desired:desiredCount,running:runningCount}'

Logging with FireLens and CloudWatch¶

{
  "logConfiguration": {
    "logDriver": "awsfirelens",
    "options": {
      "Name": "datadog",
      "apikey_arn": "arn:aws:secretsmanager:us-east-1:123456789:secret:datadog-api-key",
      "dd_service": "api-service",
      "dd_source": "ecs",
      "dd_tags": "env:production,team:platform"
    }
  }
}

FireLens uses a Fluent Bit or Fluentd sidecar to route container logs. It replaces the awslogs driver when you need to send logs to multiple destinations or apply filters.

# Verify the FireLens sidecar is running
aws ecs describe-tasks --cluster production \
  --tasks arn:aws:ecs:us-east-1:123456789:task/production/abc123 \
  --query 'tasks[0].containers[*].{name:name,status:lastStatus}'

# If the firelens sidecar crashes, the app container loses log routing
# Check the sidecar logs in CloudWatch (firelens container still uses awslogs)
aws logs get-log-events \
  --log-group-name /ecs/api-service-firelens \
  --log-stream-name "firelens/log_router/abc123" \
  --limit 30

Service Discovery Debugging¶

# List Cloud Map services
aws servicediscovery list-services

# Check registered instances for a service
aws servicediscovery list-instances --service-id srv-abc123

# DNS lookup to verify resolution
dig +short worker.production.local

# If DNS returns stale IPs, check the deregistration delay
# ECS deregisters tasks from Cloud Map when they stop, but DNS TTL may cache old records
# Default TTL is 60 seconds — during a deployment, clients may briefly resolve to stopping tasks

Deployment Circuit Breaker¶

# Enable circuit breaker on an existing service
aws ecs update-service \
  --cluster production \
  --service api-service \
  --deployment-configuration '{
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    },
    "minimumHealthyPercent": 100,
    "maximumPercent": 200
  }'

# The circuit breaker triggers when the number of failed tasks exceeds a threshold
# (based on desired count). When triggered, it rolls back to the previous task definition.
# Check deployment events to see if rollback occurred:
aws ecs describe-services --cluster production --services api-service \
  --query 'services[0].deployments[*].{status:status,taskDef:taskDefinition,rolloutState:rolloutState}'

# Output during a rollback:
# [
#   { "status": "PRIMARY", "taskDef": "api-service:41", "rolloutState": "COMPLETED" },
#   { "status": "ACTIVE", "taskDef": "api-service:42", "rolloutState": "FAILED" }
# ]

Capacity Provider Strategies¶

# Check current capacity provider strategy
aws ecs describe-services --cluster production --services api-service \
  --query 'services[0].capacityProviderStrategy'

# Update strategy: 2 base tasks on FARGATE, remainder 3:1 on FARGATE_SPOT
aws ecs update-service \
  --cluster production \
  --service api-service \
  --capacity-provider-strategy \
    "capacityProvider=FARGATE,weight=1,base=2" \
    "capacityProvider=FARGATE_SPOT,weight=3"

# Monitor Fargate Spot interruptions in CloudTrail
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=StopTask \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \
  --query 'Events[?contains(CloudTrailEvent, `SpotInterruption`)].{time:EventTime,event:CloudTrailEvent}'

Container Insights Metrics¶

# Enable Container Insights on a cluster
aws ecs update-cluster-settings \
  --cluster production \
  --settings name=containerInsights,value=enabled

> **Gotcha:** Container Insights costs money. Each ECS task generates custom CloudWatch metrics, and at scale (hundreds of tasks) the CloudWatch bill can be surprisingly high. Enable it on production clusters where the visibility matters and disable it on dev/staging unless actively debugging.

# Key metrics to watch:
# - CpuUtilized / CpuReserved (per service)
# - MemoryUtilized / MemoryReserved (per service)
# - NetworkRxBytes / NetworkTxBytes
# - RunningTaskCount vs DesiredTaskCount
# - StorageReadBytes / StorageWriteBytes (Fargate ephemeral storage)

# Query CPU utilization trend
aws cloudwatch get-metric-statistics \
  --namespace ECS/ContainerInsights \
  --metric-name CpuUtilized \
  --dimensions Name=ClusterName,Value=production Name=ServiceName,Value=api-service \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
  --period 60 \
  --statistics Average Maximum