Skip to content

Cloud Deep-Dive Footguns

Advanced cloud mistakes that cause hard-to-debug failures at scale.


1. IRSA/Workload Identity with overly broad role

Your EKS service account has an IAM role with AmazonS3FullAccess. Every pod using that service account can read, write, and delete any S3 bucket in the account. A compromised pod exfiltrates your entire data lake.

Fix: Scope IAM policies to specific resources: arn:aws:s3:::my-specific-bucket/*. Use separate service accounts per workload with minimum permissions.


2. VPC CNI running out of ENI IPs

Each EC2 instance has a limited number of ENI slots and IPs per ENI. On a t3.medium, you get 18 pod IPs. With 50 pods scheduled, 32 pods are stuck in Pending because the node can't allocate more IPs.

Fix: Use VPC CNI prefix delegation (assigns /28 prefixes instead of individual IPs). Right-size nodes for pod density. Monitor awscni_assigned_ip_addresses.

Default trap: A t3.medium supports 3 ENIs x 6 IPs each = 18 pod IPs (minus node IPs). A t3.xlarge supports 4 ENIs x 15 IPs = 58 pod IPs. With prefix delegation enabled, a t3.medium jumps to 110 pod IPs. This is often the difference between "pods stuck in Pending" and "cluster works fine." Check AWS docs for your instance type's ENI limits.


3. ALB ingress creating a new LB per Ingress resource

You create separate Ingress resources for each service. The AWS Load Balancer Controller creates a separate ALB for each one. You end up with 20 ALBs at $16/month each, plus data processing fees.

Fix: Use group.name annotation to combine Ingress resources into a single ALB: alb.ingress.kubernetes.io/group.name: shared. Use path-based routing on one ALB.


4. EKS managed node group stuck during upgrade

You upgrade your EKS managed node group. The new launch template has a larger instance type. But your max size equals your desired size. The node group can't create new nodes before draining old ones. Pods have nowhere to go. The upgrade stalls.

Fix: Ensure maxSize > desiredSize (at least +1) to allow surge during upgrades. Or use a blue/green node group strategy.


5. aws-auth ConfigMap corruption locks everyone out

You edit the aws-auth ConfigMap to add a new IAM role. You make a YAML syntax error. Now no one can authenticate to the cluster — not you, not CI, not the admin. The only way back is through the cluster creator's IAM identity.

Fix: Use eksctl to manage aws-auth instead of editing directly. Keep a backup. Better yet, use EKS access entries (newer feature) which don't have this fragility.

War story: This is one of the most common EKS outages. A single YAML typo in aws-auth locks out all human and CI access. Recovery requires the IAM identity that originally created the cluster — which may be an IAM user that was deleted, a role that was modified, or a CI service account nobody remembers. Always keep a backup: kubectl get configmap aws-auth -n kube-system -o yaml > aws-auth-backup.yaml.


6. Security group referencing itself causing update deadlock

You create a security group that allows inbound from itself (common for cluster communication). You try to delete or replace it with Terraform. Terraform can't delete the SG because rules reference it, and can't remove the rules without modifying the SG. Deadlock.

Fix: Use separate security group rules (aws_security_group_rule) instead of inline rules. This allows Terraform to modify rules independently. Be aware of this pattern when destroying infrastructure.


7. Persistent volumes stuck in wrong AZ

Your StorageClass uses volumeBindingMode: Immediate. A PV is created in us-east-1a. The pod is scheduled to us-east-1b. The pod can't mount the volume. It stays Pending forever.

Fix: Use volumeBindingMode: WaitForFirstConsumer. The PV is created in the same AZ as the pod. This is essential for EBS, which is AZ-specific.


8. CloudWatch log group with no retention

Your application logs are sent to CloudWatch with no retention policy (default: never expire). After a year, you're storing 10TB of logs you'll never look at, costing $300/month.

Fix: Set retention policies on all log groups (7-30 days for most). Export historical logs to S3 with lifecycle policies. Monitor CloudWatch storage costs.


9. Auto-scaling group health check mismatch

Your ASG uses EC2 health checks. The instance is running but the application crashed. EC2 health check passes (the instance is up). The ASG doesn't replace the unhealthy instance. Your app is down but your infrastructure thinks everything is fine.

Fix: Use ELB health checks for ASGs behind load balancers. Or use EKS node health checks. EC2 health checks only detect hardware/hypervisor failures.


10. Terraform state in same account as resources

You store Terraform state in the same AWS account it manages. An attacker who compromises the account can modify the state to make Terraform "forget" resources, then create their own without Terraform knowing.

Fix: Store state in a separate, hardened account (the "management" or "security" account). Use cross-account roles for access. Enable versioning and MFA delete on the state bucket.

Gotcha: An attacker who can modify Terraform state can make Terraform "forget" resources exist, effectively hiding their infrastructure changes. With state versioning enabled, you can detect unauthorized modifications by comparing state versions. MFA delete on the S3 bucket prevents state deletion even with compromised credentials.