Skip to content

Aws Troubleshooting

← Back to all decks

31 cards — 🟢 10 easy | 🟡 14 medium | 🔴 1 hard

🟢 Easy (10)

1. An API call returns AccessDenied. What do you check first?

Show answer 1) Which principal is making the call (sts get-caller-identity). 2) Check the IAM policy for explicit Deny (Deny always wins). 3) Check resource-based policies. 4) Check SCPs if using AWS Organizations. 5) Check Permissions Boundary if attached.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

2. What is the key difference between security groups and NACLs?

Show answer Security groups are stateful (return traffic auto-allowed) and allow only Allow rules. NACLs are stateless (must allow both inbound and outbound), process rules by number (lowest first), and support Deny rules. NACLs are at the subnet level; SGs at the ENI level.

Remember: NACLs are stateless (rules for inbound AND outbound needed), evaluated by rule number (lowest first), allow AND deny rules. Security groups are stateful, allow-only.

Remember: SG = stateful allow-only (return traffic auto-allowed). NACL = stateless allow+deny (must explicitly allow return traffic). SG at instance level, NACL at subnet level.

3. An EC2 instance cannot reach the internet. What do you check?

Show answer 1) Is the instance in a public subnet with an Internet Gateway? 2) Does the route table have 0.0.0.0/0 -> IGW? 3) Does the instance have a public IP or Elastic IP? 4) Security group allows outbound? 5) NACL allows outbound + ephemeral inbound? For private subnet: check NAT Gateway route.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

4. DNS resolution works from one VPC but not another. What is the most likely cause?

Show answer The Route 53 private hosted zone is associated with the first VPC but not the second. Fix: associate the hosted zone with the second VPC. For cross-account: use Route 53 Resolver endpoints or VPC association authorization.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

5. When do you use ALB vs NLB?

Show answer ALB (L7): HTTP/HTTPS routing, path/host-based rules, WAF integration, gRPC support. NLB (L4): TCP/UDP, ultra-low latency, static IPs, TLS passthrough. Use NLB for non-HTTP protocols, or when you need a fixed IP. Use ALB for web applications needing smart routing.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

6. An EC2 instance won't start. What do you check?

Show answer 1) Get the system log: aws ec2 get-console-output. 2) Check instance status checks — system = host issue, instance = OS issue. 3) Verify the AMI exists and is not deregistered. 4) Check for account-level limits (vCPU quota). 5) EBS volume issues (snapshot still restoring, encrypted volume missing KMS access).

Gotcha: 'InsufficientInstanceCapacity' = no available hardware for that instance type in the AZ. Try a different AZ or instance type.

7. What is the difference between EBS and instance store?

Show answer EBS persists independently of the instance — survives stop/start, can be snapshotted and detached. Instance store is ephemeral — data lost on stop/terminate/host failure. Use EBS for anything you need to keep. Instance store for scratch data, caches, or temp processing.

Remember: EBS = Elastic Block Store. Network-attached block storage for EC2. Think 'virtual hard drive.' Types: gp3 (general), io2 (high IOPS), st1 (throughput), sc1 (cold).

8. An S3 GetObject returns 403 Forbidden. What do you check?

Show answer 1) Bucket policy may have an explicit Deny. 2) IAM policy on the caller may lack s3:GetObject. 3) Object may be owned by a different account (check ACLs or use BucketOwnerEnforced). 4) If using VPC endpoint, check the endpoint policy. 5) KMS key policy if the object is encrypted.

Remember: 'When in doubt, check IAM.' Access Denied = missing policy, explicit deny, SCP boundary, or wrong account. Use IAM Policy Simulator to debug.

9. When is a CloudWatch alarm in INSUFFICIENT_DATA state?

Show answer 1) The metric has no data points in the evaluation period (instance stopped, metric not yet published). 2) Alarm was just created and hasn't collected enough data. 3) Namespace or metric name is wrong. 4) The dimension (e.g., InstanceId) doesn't match. Check the metric in the console to verify data exists.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

10. You created a resource but can't find it in the console. What is the most common mistake?

Show answer Wrong region selected in the console. AWS resources are regional (except IAM, Route 53, CloudFront, S3 bucket names). Always verify the region in the console dropdown matches where you created the resource. Also check: correct account if using Organizations.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

🟡 Medium (14)

1. How do you debug a complex IAM deny using CloudTrail?

Show answer Find the event in CloudTrail — the errorCode is AccessDenied and errorMessage often hints at which policy denied. Use IAM Policy Simulator to test the principal + action + resource. Check for implicit denies (no matching Allow) vs explicit Deny.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

2. An IAM role works from one account but not another. What is likely wrong?

Show answer Cross-account access requires both: 1) the role's trust policy must allow sts:AssumeRole from the source account, AND 2) the source account must have an IAM policy allowing sts:AssumeRole on the target role ARN. Missing either side causes AccessDenied.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

3. Traffic works from instance A to B but not B to A on the same subnet. What do you check?

Show answer 1) Security group on B may not allow inbound from A's SG/IP. 2) NACL on the subnet may have a Deny rule for B's source port range (ephemeral ports). 3) Host-level firewall (iptables) on B. Check VPC Flow Logs to confirm drops.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

4. Traffic between two VPCs over peering is not working. What do you check?

Show answer 1) Peering connection is Active in both VPCs. 2) Route tables in BOTH VPCs have routes to the peer CIDR pointing to the peering connection. 3) Security groups reference the peer CIDR (not just the peer SG — SG references across peering require same region). 4) NACLs allow the traffic.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

5. An application cannot resolve a private DNS name in a VPC. What do you check?

Show answer 1) enableDnsSupport and enableDnsHostnames are true on the VPC. 2) Private hosted zone is associated with the VPC. 3) Record exists in the hosted zone. 4) Instance is using the VPC DNS resolver (x.x.x.2). 5) If cross-VPC: check Route 53 Resolver rules and VPC associations.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

6. ALB returns 502 Bad Gateway intermittently. What do you check?

Show answer 1) Target health — is the backend failing health checks? 2) Target response time — is the backend timing out? 3) Backend is closing connections before ALB finishes. 4) Security group on targets must allow traffic from the ALB. 5) Check ALB access logs for backend connection errors.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

7. You can't SSH into an EC2 instance that was working yesterday. What do you check?

Show answer 1) Security group still allows port 22 from your IP (check if your IP changed). 2) Instance status checks passing. 3) Key pair matches. 4) Disk full (check console output). 5) sshd crashed (use EC2 Serial Console or SSM Session Manager). 6) NACL or host firewall change.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

8. An EBS volume shows high latency. What do you check?

Show answer 1) Volume type — gp2 may be out of burst credits (check BurstBalance metric). Upgrade to gp3. 2) IOPS/throughput limits hit (check VolumeReadOps/VolumeWriteOps). 3) Instance throughput limit — smaller instances have EBS bandwidth caps. 4) Check if the volume is being snapshotted (first snapshot is slow).

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

9. What are common EFS performance gotchas?

Show answer 1) Default bursting mode has low baseline throughput for small filesystems. 2) Latency is higher than EBS (network filesystem). 3) File locking behaves differently from local POSIX. 4) Security group must allow NFS (2049) from mount targets. 5) Provisioned throughput mode costs more but gives consistent performance.

Remember: EFS = Elastic File System. NFS-compatible, shared across multiple EC2 instances. Auto-scales, pay per GB stored. Think 'shared network drive.'

10. S3 ListObjects returns incomplete results. What is happening?

Show answer S3 paginates results — default max is 1000 objects per request. You must use the continuation token (NextContinuationToken for v2, Marker for v1) to get all objects. Also: check if a prefix filter is too restrictive or if objects were recently deleted.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

11. Custom CloudWatch metrics are not appearing. What do you check?

Show answer 1) IAM role has cloudwatch:PutMetricData permission. 2) Namespace is correct (custom namespaces are case-sensitive). 3) Timestamp is within 2 weeks past or 2 hours future. 4) Agent is running and configured correctly. 5) Check the agent log for errors. 6) Metric may take a few minutes to appear.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

12. EKS pods cannot pull images from ECR. What do you check?

Show answer 1) Node IAM role needs ecr:GetDownloadUrlForLayer, ecr:BatchGetImage, ecr:GetAuthorizationToken. 2) ECR repository policy allows the account. 3) If cross-account: add the node role ARN to the repo policy. 4) VPC endpoints for ECR (com.amazonaws.region.ecr.dkr and .api) if no NAT/IGW.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

13. EKS worker nodes are NotReady. What do you check?

Show answer 1) aws-node DaemonSet (VPC CNI) running? kubectl get ds -n kube-system. 2) Node has IP addresses available in the subnet. 3) Node IAM role has AmazonEKSWorkerNodePolicy, AmazonEKS_CNI_Policy, AmazonEC2ContainerRegistryReadOnly. 4) Node security group allows communication with the control plane. 5) kubelet logs on the node.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

14. You notice unexpected AWS charges. How do you investigate?

Show answer 1) Cost Explorer: filter by service, region, usage type. 2) Check for forgotten resources: running instances, unattached EBS volumes, idle NAT Gateways, unused Elastic IPs. 3) Look at data transfer costs (often overlooked). 4) Check for resources in unusual regions. 5) Enable AWS Budgets alerts to catch early.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.

🔴 Hard (1)

1. EKS pods can reach the internet but not other pods across nodes. What is likely wrong?

Show answer VPC CNI issue: 1) aws-node pods not running on all nodes. 2) Security group doesn't allow pod-to-pod traffic (nodes must allow all traffic from themselves). 3) Subnet has no free IPs for secondary ENIs. 4) NACL blocking inter-node traffic. Check: kubectl logs -n kube-system aws-node-.

Remember: enable VPC Flow Logs and CloudTrail before you need them. Troubleshooting without logs is like debugging without stack traces — possible but painful.