Skip to content

Cloud Operations Basics Cheat Sheet

One-liner: aws sts get-caller-identity is the cloud equivalent of whoami — run it first in every troubleshooting session to confirm which account, role, and user you are operating as. Most "permission denied" issues start with being in the wrong profile.

AWS CLI Essentials

# Identity
aws sts get-caller-identity
aws configure list                     # Show active profile

# EC2
aws ec2 describe-instances --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].[InstanceId,InstanceType,PrivateIpAddress,Tags[?Key==`Name`].Value|[0]]' \
  --output table

# S3
aws s3 ls s3://bucket/prefix/
aws s3 sync ./dist s3://bucket/ --delete
aws s3 presign s3://bucket/file --expires-in 3600

# Logs
aws logs tail /aws/lambda/my-func --follow
aws logs filter-log-events --log-group-name /ecs/app --filter-pattern "ERROR"

GCP gcloud Essentials

gcloud config set project PROJECT_ID
gcloud config list
gcloud auth application-default login

gcloud compute instances list
gcloud container clusters list
gcloud container clusters get-credentials CLUSTER --zone ZONE
gcloud iam service-accounts list
gcloud logging read 'severity>=ERROR' --limit=50

IAM Quick Reference

Concept AWS GCP Azure
User identity IAM User Google Account AD User
Machine identity IAM Role Service Account Managed Identity
Permission set IAM Policy IAM Role RBAC Role
Permission boundary Permission Boundary Org Policy Management Group
Temp credentials STS AssumeRole Workload Identity Managed Identity token

Networking Comparison

Concept AWS GCP Azure
Virtual network VPC VPC VNet
Subnet Subnet Subnet Subnet
Firewall rules Security Groups Firewall Rules NSGs
NAT NAT Gateway Cloud NAT NAT Gateway
Load balancer (L7) ALB HTTP(S) LB App Gateway
Load balancer (L4) NLB TCP/UDP LB Azure LB
DNS Route 53 Cloud DNS Azure DNS
CDN CloudFront Cloud CDN Azure CDN

VPC Troubleshooting Flow

Can't connect?
  ├── DNS resolution works?   dig/nslookup
  ├── Security Group allows?  Check inbound rules on target
  ├── NACL allows?            Check both inbound AND outbound
  ├── Route table has route?  Check both subnets
  ├── NAT Gateway (if private  internet)?
  ├── VPC peering / Transit GW route?
  └── Application listening on right port?

Cost Control Checklist

Daily:
  [ ] Check Cost Explorer for anomalies
  [ ] Review any budget alerts

Weekly:
  [ ] Find orphaned resources (unattached EBS, unused EIPs)
  [ ] Check for oversized instances (CPU < 10% avg)
  [ ] Review data transfer charges

Monthly:
  [ ] Right-size instances based on CloudWatch metrics
  [ ] Evaluate Reserved Instance / Savings Plan coverage
  [ ] Review and clean old snapshots and AMIs
  [ ] Check for idle load balancers

Default trap: AWS default limits are surprisingly low for production use. The most common surprise: 5 EIPs per region and 5 VPCs per region. Lambda's default 1,000 concurrent executions can cause throttling during traffic spikes. Always request limit increases before you need them — increases take minutes but discovering you need one takes hours of debugging.

Common AWS Resource Limits

Resource Default Limit
VPCs per region 5
Subnets per VPC 200
Security Groups per VPC 500
Rules per SG 60 inbound + 60 outbound
EIPs per region 5
EC2 instances (on-demand) Varies by type
S3 buckets per account 100
Lambda concurrent executions 1,000

Request increases via Service Quotas console.

Gotcha: aws s3 sync --delete mirrors a local directory to S3, deleting remote files that do not exist locally. This is a destructive operation — if you accidentally run it from an empty directory, it wipes the bucket. Always do a dry run first: aws s3 sync ./dist s3://bucket/ --delete --dryrun.

Profile and Credential Management

# AWS SSO (recommended)
aws configure sso
export AWS_PROFILE=prod
aws sts get-caller-identity

# AWS per-command profile
aws s3 ls --profile staging

# aws-vault (secure credential storage)
aws-vault exec prod -- terraform plan

# GCP configurations
gcloud config configurations create prod
gcloud config configurations activate prod
gcloud config configurations list

Quick Debugging Commands

# AWS: Check why instance can't reach internet
aws ec2 describe-route-tables --filters "Name=association.subnet-id,Values=subnet-xxx"
aws ec2 describe-nat-gateways --filter "Name=state,Values=available"
aws ec2 describe-security-groups --group-ids sg-xxx

# AWS: Check EKS node status
aws eks describe-cluster --name my-cluster --query 'cluster.status'
aws ec2 describe-instances --filters "Name=tag:eks:cluster-name,Values=my-cluster" \
  --query 'Reservations[].Instances[].[InstanceId,State.Name]'

# GCP: Check GKE node pool
gcloud container node-pools list --cluster my-cluster --zone us-central1-a