- cloud
- l1
- topic-pack
- aws-troubleshooting --- Portal | Level: L1: Foundations | Topics: AWS Troubleshooting | Domain: Cloud
AWS Troubleshooting - Primer¶
Why This Matters¶
AWS is where your infrastructure lives, and when something breaks, the clock is ticking. The difference between a 5-minute fix and a 2-hour outage often comes down to knowing which AWS tool to reach for and which API to query. IAM denials, VPC misconfigurations, and load balancer health check failures account for the majority of AWS-related incidents — and all of them are debuggable if you know where to look.
Core Concepts¶
1. IAM Debugging¶
Most AWS "it doesn't work" problems are IAM problems. The error message is almost always Access Denied with no further detail.
Remember: IAM debugging order: "I CRP SV" — Identity (who am I?), Compute policies (attached + inline), Resource policies (bucket/key policies), Permission boundaries, SCPs (org-level), VPC endpoint policies. Work from most specific to broadest scope. Skip steps and you will miss the one deny that blocks everything.
Debug clue: The single most useful IAM command is
aws sts decode-authorization-message. When you get an encoded error message back from an API call, this decodes it into a JSON blob showing exactly which policy statement denied the request, which action was attempted, and which resource was targeted. Most engineers never learn this command exists.
# Check who you are (first step in any IAM debug)
aws sts get-caller-identity
# {
# "Account": "123456789012",
# "Arn": "arn:aws:iam::123456789012:role/deploy-role",
# "UserId": "AROA..."
# }
# Simulate a permission check without actually doing it
aws iam simulate-principal-policy \
--policy-source-arn arn:aws:iam::123456789012:role/deploy-role \
--action-names s3:GetObject \
--resource-arns arn:aws:s3:::my-bucket/config.yaml
# List policies attached to a role
aws iam list-attached-role-policies --role-name deploy-role
aws iam list-role-policies --role-name deploy-role # inline policies
# Get the actual policy document
aws iam get-role-policy --role-name deploy-role --policy-name my-inline-policy
# Check for SCPs (Service Control Policies) blocking access
aws organizations list-policies --filter SERVICE_CONTROL_POLICY
# Decode an encoded authorization failure message
aws sts decode-authorization-message --encoded-message <encoded-message> | jq '.DecodedMessage | fromjson'
IAM troubleshooting order:
1. Confirm identity (sts get-caller-identity)
2. Check role/user policies (attached + inline)
3. Check resource policies (S3 bucket policy, KMS key policy)
4. Check permission boundaries
5. Check SCPs (organization-level)
6. Check VPC endpoint policies (if using VPC endpoints)
2. VPC Flow Logs¶
When traffic is not reaching its destination, VPC flow logs show you what is being accepted and rejected:
# Enable flow logs on a VPC
aws ec2 create-flow-logs \
--resource-type VPC \
--resource-ids vpc-abc123 \
--traffic-type ALL \
--log-destination-type cloud-watch-logs \
--log-group-name /vpc/flow-logs
# Query flow logs in CloudWatch Insights
aws logs start-query \
--log-group-name /vpc/flow-logs \
--start-time $(date -d '1 hour ago' +%s) \
--end-time $(date +%s) \
--query-string 'filter action = "REJECT" | stats count() by srcAddr, dstAddr, dstPort'
Flow log format:
version account-id interface-id srcaddr dstaddr srcport dstport protocol packets bytes start end action log-status
2 123456789012 eni-abc123 10.0.1.5 10.0.2.10 443 49321 6 15 1500 1610000000 1610000060 ACCEPT OK
2 123456789012 eni-abc123 203.0.113.5 10.0.1.5 22 54321 6 3 180 1610000000 1610000060 REJECT OK
REJECT at the security group or NACL level means:
- Security group: no matching inbound/outbound rule
- NACL: explicit deny or no matching allow rule
- Check both: security groups are stateful, NACLs are stateless
3. CloudTrail¶
Under the hood: VPC flow logs do NOT capture DNS traffic (port 53 to the VPC resolver at .2), DHCP traffic, traffic to the instance metadata service (169.254.169.254), or Windows license activation traffic. If your debugging involves any of these, flow logs will show nothing. This catches people off guard during DNS-related investigations.
Timeline: AWS CloudTrail launched on November 13, 2013. Before CloudTrail, the only way to know who changed what in your AWS account was to grep through application logs or hope someone remembered. CloudTrail made every AWS API call auditable — retroactively answering "who deleted that security group at 3 AM?"
CloudTrail records every API call. Essential for "who changed what and when":
# Look up recent events
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=StopInstances \
--max-results 10
# Find who deleted a resource
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=ResourceName,AttributeValue=i-abc123 \
--start-time 2024-01-14T00:00:00Z
# Check for AccessDenied events (someone tried something they shouldn't)
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=EventName,AttributeValue=ConsoleLogin \
--max-results 20 | jq '.Events[] | select(.ErrorCode == "AccessDenied")'
Gotcha: CloudTrail has a 15-minute delay for management events and longer for data events. If you are investigating something that happened 2 minutes ago, CloudTrail will not have it yet. For real-time visibility, use EventBridge rules that trigger on CloudTrail events.
4. EC2 Instance Issues¶
# Check instance status (instance-level + system-level checks)
aws ec2 describe-instance-status --instance-ids i-abc123
# Get system log (serial console output — boot errors, kernel panics)
aws ec2 get-console-output --instance-id i-abc123 --output text
# Check instance metadata from inside the instance
curl -s http://169.254.169.254/latest/meta-data/instance-id
curl -s http://169.254.169.254/latest/meta-data/iam/security-credentials/
# Common instance issues and checks
# 1. Instance won't start
aws ec2 describe-instance-status --instance-ids i-abc123 --include-all-instances
# 2. Cannot SSH
aws ec2 describe-security-groups --group-ids sg-abc123 | jq '.SecurityGroups[].IpPermissions[] | select(.FromPort == 22)'
# 3. Instance stuck "stopping"
aws ec2 stop-instances --instance-ids i-abc123 --force
# 4. Disk full (EBS)
aws ec2 describe-volumes --filters Name=attachment.instance-id,Values=i-abc123
aws cloudwatch get-metric-statistics \
--namespace AWS/EBS --metric-name VolumeQueueLength \
--dimensions Name=VolumeId,Value=vol-abc123 \
--start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 300 --statistics Average
5. ELB Health Check Debugging¶
# Describe target health (ALB/NLB)
aws elbv2 describe-target-health --target-group-arn arn:aws:elasticloadbalancing:...
# Output:
# "TargetHealth": {
# "State": "unhealthy",
# "Reason": "Target.ResponseCodeMismatch",
# "Description": "Health checks failed with these codes: [503]"
# }
# Check health check configuration
aws elbv2 describe-target-groups --target-group-arns arn:aws:elasticloadbalancing:... \
| jq '.TargetGroups[] | {HealthCheckPath, HealthCheckPort, HealthCheckProtocol, HealthyThresholdCount, UnhealthyThresholdCount, HealthCheckIntervalSeconds}'
# Common health check failures:
# Target.ResponseCodeMismatch → App returning wrong status code
# Target.Timeout → App not responding in time, check SG/NACL
# Target.FailedHealthChecks → Intermittent failures, check app logs
# Elb.InternalError → AWS-side issue, check service health
# Verify from the instance itself
curl -v http://localhost:8080/health
# Does it return 200? Does it respond within the health check timeout?
# Check security group allows health check traffic
aws ec2 describe-security-groups --group-ids sg-abc123 | \
jq '.SecurityGroups[].IpPermissions[] | select(.FromPort <= 8080 and .ToPort >= 8080)'
Default trap: ELB health check defaults differ by type. ALB default path is
/with a 200 expected response. NLB defaults to TCP health checks (just a connection attempt, no HTTP). If you switch from ALB to NLB and your app relies on HTTP health checks, everything will appear healthy even if the app is returning 500s.War story: A common production incident: an engineer rotates IAM access keys but forgets to update one service. The service starts throwing
Access Deniedon S3 calls. The fix is simple, but the investigation takes hours because CloudTrail shows the old key ID being used — and nobody connects "key rotation yesterday" to "S3 failures today." Always search CloudTrail for the old access key ID after rotation to verify nothing is still using it.
6. S3 Permissions¶
S3 has multiple layers of access control:
# Check bucket policy
aws s3api get-bucket-policy --bucket my-bucket | jq '.Policy | fromjson'
# Check bucket ACL
aws s3api get-bucket-acl --bucket my-bucket
# Check public access block
aws s3api get-public-access-block --bucket my-bucket
# Test access to a specific object
aws s3api head-object --bucket my-bucket --key config.yaml
# Check if bucket encryption is blocking access
aws s3api get-bucket-encryption --bucket my-bucket
S3 access denied checklist:
1. IAM policy allows the action (s3:GetObject, s3:PutObject, etc.)
2. Bucket policy does not explicitly deny the principal
3. Object ACL allows access (if ACLs are enabled)
4. KMS key policy allows decrypt (if bucket uses KMS encryption)
5. VPC endpoint policy allows the action (if using VPC endpoint)
6. Public access block is not interfering
7. Object ownership settings (BucketOwnerEnforced vs ObjectWriter)
7. Common CLI Patterns¶
# Find resources across all regions
for region in $(aws ec2 describe-regions --query 'Regions[].RegionName' --output text); do
echo "=== ${region} ==="
aws ec2 describe-instances --region "${region}" --query 'Reservations[].Instances[].{Id:InstanceId,State:State.Name,Type:InstanceType}' --output table
done
# Get all security group rules for an instance
INSTANCE_SGS=$(aws ec2 describe-instances --instance-ids i-abc123 --query 'Reservations[].Instances[].SecurityGroups[].GroupId' --output text)
for sg in ${INSTANCE_SGS}; do
echo "=== ${sg} ==="
aws ec2 describe-security-groups --group-ids "${sg}" --query 'SecurityGroups[].IpPermissions[]' --output table
done
# Watch CloudWatch metrics in terminal
watch -n 60 "aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 --metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-abc123 \
--start-time \$(date -u -d '10 min ago' +%Y-%m-%dT%H:%M:%SZ) \
--end-time \$(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 --statistics Average --output text"
# Quick cost check
aws ce get-cost-and-usage \
--time-period Start=$(date -d '-7 days' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--granularity DAILY --metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE --output table
Interview tip: When asked "how do you troubleshoot an AWS connectivity issue?", the strongest answer walks through layers: 1) IAM —
sts get-caller-identity+ policy simulation, 2) Network — security groups, NACLs, route tables, flow logs, 3) Service-specific — target health, instance status, S3 access layers, 4) Audit — CloudTrail for recent changes. Showing you have a systematic approach beats jumping straight to "check the security group."
Key Takeaway¶
AWS troubleshooting follows a pattern: confirm your identity (STS), check permissions (IAM policies + resource policies), check network path (security groups + NACLs + flow logs), and check the service-specific health (target health, instance status, S3 access layers). CloudTrail tells you what changed, flow logs tell you what is blocked, and the CLI is always faster than the console for debugging.
Wiki Navigation¶
Related Content¶
- AWS Troubleshooting Flashcards (CLI) (flashcard_deck, L1) — AWS Troubleshooting