AWS ECS Footguns¶

Mistakes that cause deployment failures, outages, or silent misconfigurations in ECS.

1. Confusing task role with execution role¶

You put S3 permissions on the execution role and wonder why your container gets AccessDenied. The execution role is used by the ECS agent — it pulls images from ECR and writes logs to CloudWatch. The task role is assumed by your application at runtime — it accesses S3, DynamoDB, SQS, Secrets Manager, etc.

Fix: Always configure two separate IAM roles. The execution role gets ecr:GetAuthorizationToken, ecr:BatchGetImage, logs:CreateLogStream, logs:PutLogEvents. The task role gets whatever your application needs. Never combine them into one role.

Remember: Execution role = ECS agent's identity (pulls images, writes logs). Task role = your app's identity (calls AWS APIs). Think of it like: execution role is the courier who delivers the package, task role is the person who opens it. The courier doesn't need to know what's inside.

2. awsvpc mode exhausting subnet IPs and ENIs¶

You deploy 50 Fargate tasks into a /24 subnet (254 usable IPs). Each task consumes one IP. Your ALB, NAT gateway, and RDS instances also consume IPs in the same subnet. At 200 tasks you run out and placement fails with no obvious error — the service event just says "unable to place a task."

On EC2 launch type, each awsvpc task consumes an ENI. A c5.large supports roughly 10 ENIs total (including the primary). Run 12 small tasks and you hit the ENI limit before touching CPU or memory.

Fix: Use /20 or larger subnets for Fargate workloads. For EC2, consider bridge networking mode if you run many small tasks per instance. Monitor AvailableIpAddressCount on your subnets.

3. Health check grace period too short¶

Your container takes 45 seconds to start. The load balancer health check runs every 10 seconds with a 2-unhealthy threshold. The target gets deregistered at 20 seconds. ECS replaces the task. The new task also takes 45 seconds. The service enters an infinite restart loop.

Fix: Set healthCheckGracePeriodSeconds on the service to at least 1.5x your container's startup time. If the container takes 45 seconds, set the grace period to 90 seconds. This pauses health check evaluation during startup.

aws ecs update-service \
  --cluster production \
  --service api-service \
  --health-check-grace-period-seconds 90

4. Fargate task size limits catch you mid-migration¶

You migrate from EC2 to Fargate. Your task definition requests 32 vCPU and 128 GB memory. Fargate caps at 16 vCPU and 120 GB. The task fails to register. Or you request 1 vCPU with 16 GB memory — Fargate only allows 2-8 GB for 1 vCPU tasks. The valid combinations are fixed and not intuitive.

Fix: Check the Fargate vCPU/memory matrix before migrating. If your workload exceeds Fargate limits, stay on EC2 launch type or restructure the workload into smaller tasks.

5. No SSH into Fargate — and ECS Exec requires setup¶

A container is misbehaving. You try to SSH in. There is no SSH on Fargate — there is no host to SSH into. You try ECS Exec but get "Unable to start command" because the task role lacks ssmmessages:* permissions, or the VPC has no SSM endpoints, or the task was launched before you enabled ECS Exec on the service.

Fix: Pre-configure ECS Exec before you need it: 1. Add ssmmessages:CreateControlChannel, ssmmessages:CreateDataChannel, ssmmessages:OpenControlChannel, ssmmessages:OpenDataChannel to the task role. 2. Ensure the VPC has NAT gateway access or VPC endpoints for ssmmessages. 3. Enable execute command on the service: aws ecs update-service --enable-execute-command. 4. Force a new deployment so tasks pick up the SSM agent sidecar.

6. Minimum healthy percent set too low causes downtime¶

You set minimumHealthyPercent: 0 to allow full replacement during deploys. ECS stops all running tasks before starting new ones. For the duration of the new tasks booting, you have zero capacity. If the new task definition is broken, you have zero capacity permanently until you roll back.

Fix: Keep minimumHealthyPercent: 100 and maximumPercent: 200 for zero-downtime deployments. This starts new tasks before stopping old ones. You temporarily run double capacity, but you never go below your desired count. For cost-constrained environments, use minimumHealthyPercent: 50 — you lose half capacity briefly, not all of it.

7. CloudWatch Logs group does not exist¶

Your task definition references /ecs/api-service as the log group. The log group does not exist. The task starts, runs, and you see no logs. The container appears healthy but you are flying blind. ECS does not fail the task when log delivery fails — it silently drops logs.

Fix: Create the log group before deploying, or add awslogs-create-group: "true" to the log configuration options (requires the execution role to have logs:CreateLogGroup). Verify log delivery after every new service deployment.

{
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": {
      "awslogs-group": "/ecs/api-service",
      "awslogs-region": "us-east-1",
      "awslogs-stream-prefix": "api",
      "awslogs-create-group": "true"
    }
  }
}

8. Service auto scaling vs cluster auto scaling confusion¶

You set up service auto scaling (target tracking on CPU). Traffic spikes. The service wants 20 tasks. But the EC2 cluster only has capacity for 12. Tasks pile up in PENDING state. The service auto scaler keeps trying but there is no compute to place them on. You assumed "auto scaling" meant the cluster would grow — it does not.

Fix: Service auto scaling adjusts task count. Cluster auto scaling (via capacity providers and EC2 Auto Scaling Groups) adjusts compute. You need both. Configure a capacity provider with managed scaling enabled so the cluster grows when tasks cannot be placed:

aws ecs create-capacity-provider \
  --name ec2-capacity \
  --auto-scaling-group-provider "autoScalingGroupArn=arn:...,managedScaling={status=ENABLED,targetCapacity=80}"

On Fargate, cluster scaling is automatic — this problem only applies to EC2 launch type.

9. ECS agent version drift on EC2¶

You launched EC2 instances 8 months ago with an ECS-optimized AMI. The ECS agent on those instances is version 1.68. The current version is 1.82. New features (ECS Exec, runtime ID, container dependency conditions) do not work on the old agents. Task definitions that use new features fail on old instances but succeed on new ones.

Fix: Enable auto-update for the ECS agent: set ECS_AGENT_UPDATE_ENABLED=true in /etc/ecs/ecs.config. Or use the SSM document AmazonECS-UpdateContainerAgent. Or rotate instances regularly by updating the launch template AMI and cycling the Auto Scaling Group.

10. Fargate platform version pinning¶

You deploy without specifying a platform version. ECS uses the LATEST platform version, which AWS updates periodically. One day a platform version change introduces a behavior difference — networking initialization takes longer, or a kernel change breaks your workload. You cannot reproduce because the old platform version was already deprecated.

Fix: Pin the platform version in your service definition:

aws ecs create-service \
  --cluster production \
  --service api-service \
  --platform-version 1.4.0 \
  ...

Test new platform versions explicitly before adopting. Track LATEST in dev, pin specific versions in production.

11. Secrets fail to resolve at task start¶

Your task definition references a Secrets Manager secret. The execution role has secretsmanager:GetSecretValue but not for the specific secret ARN. Or the secret is in a different region. Or the VPC has no endpoint for Secrets Manager and no NAT gateway. The task fails to start with "ResourceInitializationError: unable to pull secrets."

Fix: Verify the execution role policy grants access to the exact secret ARN (not a wildcard). Ensure the secret exists in the same region as the ECS cluster. Add a VPC endpoint for secretsmanager or ensure NAT gateway access. Test with aws secretsmanager get-secret-value --secret-id <arn> using the execution role credentials.

Debug clue: The error "ResourceInitializationError: unable to pull secrets" is maddeningly vague. Check in this order: (1) execution role has secretsmanager:GetSecretValue for the exact ARN, (2) if the secret uses a CMK, the execution role also needs kms:Decrypt on that key, (3) the subnet has a route to Secrets Manager (NAT gateway or VPC endpoint). Missing KMS permissions is the most commonly overlooked cause.