Skip to content

Cloud Operations Basics - Street Ops

What experienced cloud operators know that the console doesn't teach.

Incident Runbooks

IAM Permission Debugging

1. Symptom: "Access Denied" error on an API call

2. Identify the caller:
   # AWS: who am I?
   aws sts get-caller-identity
   # Shows: account, ARN, user/role name

3. Check what permissions the caller has:
   # AWS: simulate the policy
   aws iam simulate-principal-policy \
     --policy-source-arn arn:aws:iam::123456789:role/my-role \
     --action-names s3:GetObject \
     --resource-arns arn:aws:s3:::my-bucket/path/file.txt

   # AWS: check policy details
   aws iam list-attached-role-policies --role-name my-role
   aws iam get-role-policy --role-name my-role --policy-name inline-policy

4. Common causes:
   a. Missing permission in the IAM policy
      - Check: does the policy allow the specific action?
      - Fix: add the missing action to the policy

   b. Resource ARN mismatch
      - Policy allows s3:GetObject on arn:aws:s3:::my-bucket
      - But you also need arn:aws:s3:::my-bucket/* for objects
      - S3 is notorious for needing both bucket-level and object-level ARNs

> **Default trap:** S3 bucket policies need two separate ARN entries: `arn:aws:s3:::my-bucket` for bucket-level operations (ListBucket) and `arn:aws:s3:::my-bucket/*` for object-level operations (GetObject, PutObject). Missing the `/*` is the single most common IAM permission error in AWS.

   c. Explicit deny overrides allow
      - An SCP (Service Control Policy), permission boundary, or resource
        policy has an explicit Deny that overrides any Allow
      - Check SCPs: aws organizations list-policies --filter SERVICE_CONTROL_POLICY
      - Check permission boundaries on the role

   d. Condition not met
      - Policy requires MFA, specific IP, or tag condition
      - Check the Condition block in the policy

   e. Wrong region
      - Some services are regional. Policy for us-east-1 doesn't help in eu-west-1
      - Check the resource ARN includes the correct region

   f. Cross-account access
      - Both the source account AND destination account must allow the access
      - Source: IAM policy must allow the action
      - Destination: resource policy (S3 bucket policy, KMS key policy) must trust the source

5. Use CloudTrail to find the exact error:
   # AWS: check recent access denied events
   aws cloudtrail lookup-events \
     --lookup-attributes AttributeKey=EventName,AttributeValue=GetObject \
     --max-results 10
   # The error message in CloudTrail often has more detail than the API response

VPC Connectivity Issues

1. Symptom: Can't reach a service from another instance/service

2. Systematic debugging:
   a. Check Security Groups:
      # Source instance SG: does it allow outbound to destination?
      aws ec2 describe-security-groups --group-ids sg-source

      # Destination instance SG: does it allow inbound from source?
      aws ec2 describe-security-groups --group-ids sg-dest

      # Most common issue: destination SG doesn't allow inbound from source
      # Fix: add inbound rule for source SG or source CIDR

   b. Check Route Tables:
      # Can the source subnet route to the destination?
      aws ec2 describe-route-tables --filters "Name=association.subnet-id,Values=subnet-source"

      # Is there a route to the destination CIDR?
      # For cross-VPC: is there a VPC peering connection and route?
      # For internet: is there an IGW (public) or NAT GW (private)?

   c. Check NACLs (often forgotten):
      aws ec2 describe-network-acls --filters "Name=association.subnet-id,Values=subnet-source"
      # NACLs are stateless: need both inbound AND outbound rules
      # AND on both source AND destination subnets
      # Evaluated in rule number order: first match wins

   d. Check the instance itself:
      # Is the service listening? (SSH in or use SSM)
      ss -tlnp | grep :8080

      # Is the host firewall blocking? (iptables/firewalld)
      iptables -L -n

      # Is the instance in the right subnet?
      aws ec2 describe-instances --instance-ids i-xxx --query 'Reservations[].Instances[].{SubnetId:SubnetId, VpcId:VpcId, PrivateIp:PrivateIpAddress, SGs:SecurityGroups}'

3. Cross-VPC connectivity:
   - VPC Peering: check peering connection is active AND route tables updated in BOTH VPCs
   - Transit Gateway: check TG route tables and associations
   - PrivateLink/VPC Endpoint: check endpoint is in the correct subnet and SG allows traffic

4. DNS resolution:
   # Can the instance resolve the hostname?
   dig myservice.internal
   # If using private hosted zones: is the VPC associated with the zone?
   aws route53 list-hosted-zones-by-vpc --vpc-id vpc-xxx --vpc-region us-east-1

Cost Spike Investigation

1. Detect:
   - Billing alert fires (you set these up on day 1, right?)
   - AWS Cost Explorer shows unexpected increase

2. Identify the cause:
   # AWS CLI: cost by service
   aws ce get-cost-and-usage \
     --time-period Start=$(date -d '-7 days' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
     --granularity DAILY \
     --metrics BlendedCost \
     --group-by Type=DIMENSION,Key=SERVICE

   # Cost Explorer in console: group by Service, then by Usage Type
   # Common culprits:
   # - EC2: instances left running, wrong instance type
   # - NAT Gateway: data processing charges (per GB)
   # - S3: unexpected data transfer
   # - RDS: Multi-AZ instances you didn't intend
   # - CloudWatch: log ingestion and storage
   # - Data transfer: cross-AZ, cross-region, internet egress

3. Common cost traps:

   NAT Gateway data processing:
   - $0.045/GB processed. A service downloading 100GB/day from S3 through
     a NAT Gateway costs $135/month just in NAT fees
   - Fix: use VPC endpoints for S3 and DynamoDB (free)

   Orphaned resources:
   - EBS volumes from terminated instances: aws ec2 describe-volumes --filters Name=status,Values=available
   - Unattached Elastic IPs: $3.65/month each
   - Old snapshots: aws ec2 describe-snapshots --owner-ids self | jq '.Snapshots | sort_by(.StartTime) | .[:10]'
   - Unused load balancers
   - Stopped instances with expensive EBS volumes

   Cross-AZ traffic:
   - $0.01/GB between AZs. High-throughput services talking cross-AZ add up.
   - Fix: use AZ-aware service mesh or deploy talky services in the same AZ

4. Quick cleanup:
   # Find unattached EBS volumes
   aws ec2 describe-volumes --filters Name=status,Values=available \
     --query 'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}' --output table

   # Find unused Elastic IPs
   aws ec2 describe-addresses --query 'Addresses[?AssociationId==null]'

   # Find old snapshots (older than 90 days)
   aws ec2 describe-snapshots --owner-ids self \
     --query 'Snapshots[?StartTime<=`2024-01-01`].{ID:SnapshotId,Size:VolumeSize,Date:StartTime}'

Cross-Account Access

1. When you need cross-account access:
   - Central logging account collects logs from all accounts
   - Deployment pipeline in account A deploys to account B
   - Shared services (monitoring, backups) across accounts

2. How it works (AWS):
   # In account B (trusting account): create a role with a trust policy
   {
     "Version": "2012-10-17",
     "Statement": [{
       "Effect": "Allow",
       "Principal": {
         "AWS": "arn:aws:iam::111111111:root"  # Account A
       },
       "Action": "sts:AssumeRole",
       "Condition": {
         "StringEquals": {
           "sts:ExternalId": "unique-id-for-security"
         }
       }
     }]
   }

   # In account A: assume the role
   aws sts assume-role \
     --role-arn arn:aws:iam::222222222:role/cross-account-deploy \
     --role-session-name deploy-session \
     --external-id unique-id-for-security

3. Debugging cross-account issues:
   - Check trust policy: does account A trust the specific role/user?
   - Check permissions policy: does the role have the needed actions?
   - Check SCPs: do organization policies allow the action?
   - Check resource policies: does the destination resource (S3, KMS) allow cross-account access?
   - ExternalId must match if specified in the trust policy

Gotchas & War Stories

Remember: The IAM evaluation order is: explicit Deny (any policy) beats any Allow. SCPs, permission boundaries, and resource policies can all inject explicit Denies that override your IAM role's Allow. When debugging "Access Denied," check for Deny rules first — an Allow can never override a Deny.

The $50K NAT Gateway bill A team routed all S3 traffic through a NAT Gateway instead of using a VPC endpoint. Millions of log files being written to S3 daily, each going through the NAT at $0.045/GB. Fix was a one-line Terraform change to add an S3 VPC endpoint. Free.

The security group that blocked itself Security group allows inbound from "itself" (common for cluster communication). Someone removes the self-referencing rule while cleaning up "unused" rules. Entire cluster loses internal communication. Prevention: document why every security group rule exists. Use Terraform with comments.

The public S3 bucket "Just make it public so the frontend can access it." Data breach. Customer data exposed. Prevention: S3 Block Public Access at the account level. Use pre-signed URLs or CloudFront with OAI for public content.

The AZ outage that wasn't planned for Single-AZ deployment. When us-east-1a had issues, the entire service went down. Multi-AZ deployments survived. Prevention: always deploy across at least 2 AZs. Test failover by deliberately taking down one AZ.

Terraform state in the wrong account Team used the same S3 bucket for all environment states. An intern with staging access could read production state (which contains database passwords). Prevention: separate state buckets per environment with strict IAM policies.

One-liner: NAT Gateway charges $0.045/GB processed. VPC Gateway Endpoints for S3 and DynamoDB are free. This single Terraform resource (aws_vpc_endpoint) can save thousands per month on data-heavy workloads.

The missing tag that cost $20K to diagnose Cost spike alert fires. Nobody can tell which team or project caused it because resources aren't tagged. Spend two weeks of engineering time investigating. Prevention: enforce tagging with AWS Config rules or SCPs that deny untagged resource creation.

Essential Cloud CLI Commands

# AWS: Identity check
aws sts get-caller-identity

# AWS: List instances
aws ec2 describe-instances --query 'Reservations[].Instances[].{ID:InstanceId, Type:InstanceType, State:State.Name, Name:Tags[?Key==`Name`].Value|[0]}' --output table

# AWS: Check security group rules
aws ec2 describe-security-groups --group-ids sg-xxx --query 'SecurityGroups[].IpPermissions[]'

# AWS: Find resources by tag
aws resourcegroupstaggingapi get-resources --tag-filters Key=Environment,Values=production

# AWS: Cost for last 7 days by service
aws ce get-cost-and-usage --time-period Start=$(date -d '-7 days' +%Y-%m-%d),End=$(date +%Y-%m-%d) --granularity DAILY --metrics BlendedCost --group-by Type=DIMENSION,Key=SERVICE

# AWS: Check CloudTrail for recent activity
aws cloudtrail lookup-events --max-results 20 --query 'Events[].{Time:EventTime, Name:EventName, User:Username}'

# AWS: SSM Session Manager (no SSH key needed)
aws ssm start-session --target i-0abc123

# AWS: Check VPC endpoint routes
aws ec2 describe-vpc-endpoints --query 'VpcEndpoints[].{Service:ServiceName, State:State, VpcId:VpcId}'

Quick Reference