Skip to content

AWS ECS - Primer

Why This Matters

Amazon Elastic Container Service (ECS) is how most AWS-native shops run containers in production. It sits between "just run a Docker container on an EC2 instance" and "operate a full Kubernetes cluster." ECS handles scheduling, placement, service discovery, rolling deployments, and auto scaling without requiring you to manage a control plane. When ECS works, containers start and stop invisibly. When it does not — tasks fail to place, services cannot reach steady state, or deployments get stuck — you need to understand the machinery underneath to fix it at 3 AM.

Core Architecture

Clusters, Services, and Tasks

ECS Cluster
├── Service: api-service (desired: 4, running: 4)
│   ├── Task: arn:...:task/abc123  (RUNNING, 10.0.1.15)
│   ├── Task: arn:...:task/def456  (RUNNING, 10.0.1.22)
│   ├── Task: arn:...:task/ghi789  (RUNNING, 10.0.2.11)
│   └── Task: arn:...:task/jkl012  (RUNNING, 10.0.2.34)
├── Service: worker-service (desired: 2, running: 2)
│   ├── Task: arn:...:task/mno345  (RUNNING, 10.0.1.40)
│   └── Task: arn:...:task/pqr678  (RUNNING, 10.0.2.55)
└── Standalone Task: arn:...:task/stu901  (migration job, STOPPED)

Cluster: A logical grouping of tasks and services. A cluster can use Fargate (serverless), EC2 instances, or both via capacity providers. It does not own compute directly — it manages placement onto compute.

Service: A long-running, self-healing abstraction. You declare a desired count; ECS keeps that many tasks running. If a task dies, the service scheduler launches a replacement. Services integrate with load balancers and service discovery.

Task: A running instantiation of a task definition. A task contains one or more containers. Tasks are ephemeral — they run, they stop, they get replaced. You never SSH into a task; you replace it.

# List clusters
aws ecs list-clusters

# List services in a cluster
aws ecs list-services --cluster production

# Describe a service (shows desired count, running count, events)
aws ecs describe-services --cluster production --services api-service

# List running tasks for a service
aws ecs list-tasks --cluster production --service-name api-service --desired-status RUNNING

# Describe a specific task (IP, container status, last status)
aws ecs describe-tasks --cluster production --tasks arn:aws:ecs:us-east-1:123456789:task/production/abc123

Task Definitions

A task definition is a versioned blueprint for running containers. It is immutable once registered — you create new revisions, never edit existing ones.

{
  "family": "api-service",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789:role/api-service-task-role",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/api:v2.3.1",
      "portMappings": [
        {
          "containerPort": 8080,
          "protocol": "tcp"
        }
      ],
      "environment": [
        { "name": "LOG_LEVEL", "value": "info" }
      ],
      "secrets": [
        {
          "name": "DB_PASSWORD",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789:secret:prod/db-password"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/api-service",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "api"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      },
      "essential": true
    }
  ]
}

Key fields:

  • networkMode: awsvpc: Each task gets its own ENI and private IP. Required for Fargate. Recommended for EC2 launch type too.
  • executionRoleArn: The role ECS uses to pull images from ECR and write logs to CloudWatch. This is the ECS agent's role.
  • taskRoleArn: The role the container application assumes at runtime. This is your application's IAM identity — for S3 access, DynamoDB, SQS, etc.
  • secrets: Pulls values from Secrets Manager or SSM Parameter Store at task start. Never bake secrets into environment variables or images.
  • essential: true: If this container stops, the entire task is stopped. For sidecar patterns, set sidecars to essential: false.
# Register a new task definition revision
aws ecs register-task-definition --cli-input-json file://task-def.json

# List task definition revisions
aws ecs list-task-definitions --family-prefix api-service

# Describe a specific revision
aws ecs describe-task-definition --task-definition api-service:42

Launch Types: Fargate vs EC2

Fargate (Serverless)

Fargate runs your containers without managing EC2 instances. AWS provisions compute per-task. You specify CPU and memory at the task definition level and pay per vCPU-second and GB-second.

Fargate vCPU/memory combinations (you must pick from these fixed combos):

vCPU Memory (GB)
0.25 0.5, 1, 2
0.5 1, 2, 3, 4
1 2, 3, 4, 5, 6, 7, 8
2 4–16 (1 GB increments)
4 8–30 (1 GB increments)
8 16–60 (4 GB increments)
16 32–120 (8 GB increments)

Fargate pricing (us-east-1, on-demand, approximate): - vCPU: ~$0.04048/hour - Memory: ~$0.004445/GB/hour - A 1 vCPU / 2 GB task costs roughly $29/month running 24/7

Fargate Spot: Same capacity at up to 70% discount. AWS can terminate your task with a 2-minute warning via SIGTERM. Good for fault-tolerant workloads (batch processing, queue workers). Not suitable for stateful services.

EC2 Launch Type

You manage a fleet of EC2 instances registered to the ECS cluster. The ECS agent runs on each instance. You control instance types, AMIs, and scaling.

When to use EC2 over Fargate: - GPU workloads (Fargate does not support GPUs) - Very large tasks (>16 vCPU, >120 GB memory) - Need for specific instance families (compute-optimized, memory-optimized) - Cost optimization at scale (reserved instances, savings plans) - Workloads requiring specific kernel parameters or host-level access

Service Types

Replica Service

Maintains a desired count of tasks spread across Availability Zones. The default and most common type.

aws ecs create-service \
  --cluster production \
  --service-name api-service \
  --task-definition api-service:42 \
  --desired-count 4 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-abc,subnet-def],securityGroups=[sg-123],assignPublicIp=DISABLED}" \
  --scheduling-strategy REPLICA

Daemon Service

Runs exactly one task per EC2 instance in the cluster. Used for log agents, monitoring agents, and node-level utilities. Only available on EC2 launch type.

aws ecs create-service \
  --cluster production \
  --service-name datadog-agent \
  --task-definition datadog-agent:5 \
  --scheduling-strategy DAEMON

Networking Modes

Each task gets its own elastic network interface (ENI) with a private IP from the VPC subnet. Required for Fargate. Behaves like a first-class VPC citizen — security groups apply at the task level.

Capacity concern: Each ENI consumes an IP address and an ENI slot on the EC2 instance. A c5.large supports ~10 ENIs. If you run many small tasks on EC2 with awsvpc mode, you can exhaust ENI capacity before CPU/memory.

bridge (EC2 only)

Docker bridge networking. Containers share the host's ENI via port mapping. Multiple containers can run the same containerPort because Docker maps them to different host ports.

host (EC2 only)

Container binds directly to the host's network namespace. No port mapping. Only one container per host can use a given port. Maximum performance, minimum isolation.

Load Balancing

ECS services integrate natively with Application Load Balancers (ALB) and Network Load Balancers (NLB).

aws ecs create-service \
  --cluster production \
  --service-name api-service \
  --task-definition api-service:42 \
  --desired-count 4 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-abc,subnet-def],securityGroups=[sg-123],assignPublicIp=DISABLED}" \
  --load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:us-east-1:123456789:targetgroup/api-tg/abc123,containerName=api,containerPort=8080"

Dynamic port mapping (bridge mode): ECS registers container instances with randomly assigned host ports into the ALB target group. The ALB routes to whichever host:port is healthy. With awsvpc mode, each task has its own IP, so the target group registers IP:containerPort directly.

Health check grace period: When a new task starts, the load balancer runs health checks. If the container takes 30 seconds to boot, but the health check interval is 10 seconds with 2 unhealthy thresholds, the task gets deregistered before it finishes starting. Set healthCheckGracePeriodSeconds to cover your container's startup time.

Service Discovery (AWS Cloud Map)

ECS can register tasks with AWS Cloud Map to provide DNS-based service discovery.

# Create a Cloud Map namespace
aws servicediscovery create-private-dns-namespace \
  --name production.local \
  --vpc vpc-abc123

# Create an ECS service with service discovery
aws ecs create-service \
  --cluster production \
  --service-name worker \
  --task-definition worker:10 \
  --desired-count 3 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-abc],securityGroups=[sg-456],assignPublicIp=DISABLED}" \
  --service-registries "registryArn=arn:aws:servicediscovery:us-east-1:123456789:service/srv-abc123"

Other services resolve worker.production.local via DNS and get back the private IPs of healthy tasks. Cloud Map supports both DNS (A/SRV records) and API-based discovery.

Auto Scaling

Service Auto Scaling (Application Auto Scaling)

Adjusts the desired count of a service based on metrics.

Target tracking (recommended for most cases):

# Register the scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/production/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 20

# Create target tracking policy (target 70% CPU)
aws application-autoscaling put-scaling-policy \
  --service-namespace ecs \
  --resource-id service/production/api-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-name cpu-target-tracking \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 300
  }'

Step scaling: More granular control. Define step adjustments based on CloudWatch alarm thresholds. Useful when you need aggressive scale-out but conservative scale-in.

Capacity Providers

Capacity providers abstract the compute layer. For Fargate, you use FARGATE and FARGATE_SPOT providers. For EC2, you associate an Auto Scaling Group.

aws ecs put-cluster-capacity-providers \
  --cluster production \
  --capacity-providers FARGATE FARGATE_SPOT \
  --default-capacity-provider-strategy \
    "capacityProvider=FARGATE,weight=1,base=2" \
    "capacityProvider=FARGATE_SPOT,weight=3"

This runs a base of 2 tasks on standard Fargate and spreads the rest 3:1 toward Fargate Spot.

Deployments

Rolling Update (Default)

ECS replaces tasks in batches. Controlled by minimumHealthyPercent and maximumPercent:

  • minimumHealthyPercent: 100, maximumPercent: 200 — start new tasks first, then drain old ones (zero-downtime default)
  • minimumHealthyPercent: 50, maximumPercent: 100 — stop half, start new half (saves capacity, brief reduced capacity)

Blue-Green with CodeDeploy

CodeDeploy manages traffic shifting between two target groups. The service runs two task sets (blue and green) behind the same ALB.

# appspec.yaml for CodeDeploy ECS blue-green
version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        TaskDefinition: "arn:aws:ecs:us-east-1:123456789:task-definition/api-service:43"
        LoadBalancerInfo:
          ContainerName: "api"
          ContainerPort: 8080

Traffic shifting strategies: AllAtOnce, Linear10PercentEvery1Minutes, Canary10Percent5Minutes.

Deployment Circuit Breaker

Automatically rolls back a deployment if new tasks fail to stabilize:

aws ecs create-service \
  --cluster production \
  --service-name api-service \
  --deployment-configuration '{
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    },
    "minimumHealthyPercent": 100,
    "maximumPercent": 200
  }'

ECS Exec (Debugging)

ECS Exec uses Systems Manager to open a shell session into a running container. It is the Fargate equivalent of docker exec.

# Enable ECS Exec on the service (requires SSM permissions on the task role)
aws ecs update-service \
  --cluster production \
  --service api-service \
  --enable-execute-command

# Exec into a running task
aws ecs execute-command \
  --cluster production \
  --task arn:aws:ecs:us-east-1:123456789:task/production/abc123 \
  --container api \
  --interactive \
  --command "/bin/sh"

Prerequisites: The task role needs ssmmessages:CreateControlChannel, ssmmessages:CreateDataChannel, ssmmessages:OpenControlChannel, ssmmessages:OpenDataChannel. The task must be running with ECS Exec enabled.

ECS Anywhere

Run ECS tasks on your own on-premise or edge servers. Install the ECS agent and SSM agent on external instances, register them to your cluster.

# Register an external instance (generates activation code)
aws ssm create-activation \
  --iam-role ECSAnywhereRole \
  --registration-limit 10

# On the external machine
curl -o ecs-anywhere-install.sh "https://amazon-ecs-agent.s3.amazonaws.com/ecs-anywhere-install-latest.sh"
sudo bash ecs-anywhere-install.sh \
  --cluster production \
  --activation-id <id> \
  --activation-code <code> \
  --region us-east-1

Use case: running containers on hardware that must stay on-premise (compliance, latency, GPU servers) while managing them through the same ECS API.

ECS vs EKS Decision

Factor ECS EKS
Control plane management Fully managed, free Managed, $0.10/hr (~$73/mo)
Learning curve Low (AWS-native concepts) High (Kubernetes ecosystem)
Ecosystem AWS-specific tooling Massive open-source ecosystem
Portability AWS lock-in Kubernetes runs anywhere
Multi-cloud No Yes (with effort)
Advanced scheduling Basic (placement constraints) Rich (affinity, taints, topology)
Service mesh App Mesh (limited adoption) Istio, Linkerd, native sidecars
GPU support EC2 launch type only Full support

Choose ECS when: Your team is AWS-native, you want simplicity, you do not need Kubernetes-specific features (CRDs, operators, advanced scheduling), and portability is not a requirement.

Choose EKS when: You need Kubernetes ecosystem tools, multi-cloud portability, advanced scheduling, or your team already knows Kubernetes.


Wiki Navigation

Prerequisites

  • AWS CloudWatch (Topic Pack, L2) — Cloud Deep Dive
  • AWS Devops Flashcards (CLI) (flashcard_deck, L1) — Cloud Deep Dive
  • AWS EC2 (Topic Pack, L1) — Cloud Deep Dive
  • AWS General Flashcards (CLI) (flashcard_deck, L1) — Cloud Deep Dive
  • AWS IAM (Topic Pack, L1) — Cloud Deep Dive
  • AWS Lambda (Topic Pack, L2) — Cloud Deep Dive
  • AWS Networking (Topic Pack, L1) — Cloud Deep Dive
  • AWS Route 53 (Topic Pack, L2) — Cloud Deep Dive
  • AWS S3 Deep Dive (Topic Pack, L1) — Cloud Deep Dive
  • Azure Flashcards (CLI) (flashcard_deck, L1) — Cloud Deep Dive