Cloud Provider Deep-Dive - Street-Level Ops¶

AWS Quick Diagnosis¶

# Who am I?
aws sts get-caller-identity

# Check EKS cluster status
aws eks describe-cluster --name my-cluster --query 'cluster.status'

# List node groups
aws eks list-nodegroups --cluster-name my-cluster

# Check node group health
aws eks describe-nodegroup --cluster-name my-cluster --nodegroup-name workers \
  --query 'nodegroup.{status:status,desiredSize:scalingConfig.desiredSize,health:health}'

# Check subnet IPs
aws ec2 describe-subnets --subnet-ids subnet-abc123 \
  --query 'Subnets[].{ID:SubnetId,Available:AvailableIpAddressCount,CIDR:CidrBlock}'

# Check security group rules
aws ec2 describe-security-groups --group-ids sg-abc123 \
  --query 'SecurityGroups[].IpPermissions'

# Check NAT Gateway (cost driver)
aws ec2 describe-nat-gateways --query 'NatGateways[].{ID:NatGatewayId,State:State,SubnetId:SubnetId}'

One-liner: Quick "am I in the right account?" sanity check: aws sts get-caller-identity --query 'Account' --output text — print just the account number. Tape this to your monitor if you manage multiple accounts.

Debug clue: If aws CLI commands return ExpiredTokenError, your SSO session expired. Run aws sso login --profile <profile>. If using IAM roles, check aws sts get-caller-identity — you might be using the wrong role or no role at all.

Gotcha: AWS CLI uses the credentials chain: environment variables -> ~/.aws/credentials -> EC2 instance profile -> ECS task role. If a command works on your laptop but fails in CI, you are probably relying on a credential source that does not exist in the CI environment. Run aws configure list to see exactly which source is active.

GCP Quick Diagnosis¶

# Who am I?
gcloud auth list
gcloud config get-value project

# Check GKE cluster
gcloud container clusters describe my-cluster --zone us-central1-a

# List node pools
gcloud container node-pools list --cluster=my-cluster --zone=us-central1-a

# Check Workload Identity
gcloud iam service-accounts get-iam-policy grokdevops-sa@PROJECT.iam.gserviceaccount.com

# Check firewall rules
gcloud compute firewall-rules list --format="table(name,direction,allowed,sourceRanges)"

# Check VPC subnets
gcloud compute networks subnets list --network=my-vpc

Gotcha: EKS Subnet Exhaustion¶

Symptoms: New pods stuck in Pending, events show "failed to assign an IP address"

# Check available IPs
aws ec2 describe-subnets --filters "Name=tag:kubernetes.io/cluster/my-cluster,Values=shared" \
  --query 'Subnets[].{AZ:AvailabilityZone,Available:AvailableIpAddressCount}'

# Quick fix: add a secondary CIDR
aws ec2 associate-vpc-cidr-block --vpc-id vpc-abc123 --cidr-block 100.64.0.0/16

# Long-term: enable VPC CNI prefix delegation
kubectl set env daemonset aws-node -n kube-system ENABLE_PREFIX_DELEGATION=true
# This assigns /28 prefixes instead of individual IPs (16 IPs per prefix)

Under the hood: Each ENI on an EC2 instance has a limit on secondary IPs (varies by instance type). A t3.medium gets 6 IPs per ENI x 3 ENIs = 18 pod IPs max. With prefix delegation, that jumps to 6 prefixes x 16 IPs = 96 pods per ENI. Check limits at aws ec2 describe-instance-types --instance-types t3.medium --query 'InstanceTypes[].NetworkInfo'.

Gotcha: IAM Permission Errors¶

Symptoms: Pod logs show "AccessDenied" or "not authorized"

# Check if IRSA is configured
kubectl get sa grokdevops -n grokdevops -o yaml | grep eks.amazonaws.com

# Check if the token is mounted
kubectl exec -n grokdevops deploy/grokdevops -- env | grep AWS

# Test from inside the pod
kubectl exec -it -n grokdevops deploy/grokdevops -- \
  aws sts get-caller-identity
# Should show the IRSA role, not the node role

Common cause: Missing OIDC provider condition in the trust policy.

Gotcha: IRSA tokens are projected into the pod as a file at /var/run/secrets/eks.amazonaws.com/serviceaccount/token. If you see the node role instead of the IRSA role in get-caller-identity, the token mount is missing — check that the ServiceAccount has the eks.amazonaws.com/role-arn annotation and the pod spec references that ServiceAccount.

Remember: AWS IAM debugging mnemonic: W-A-T — Who am I (sts get-caller-identity), Am I Allowed (iam simulate-principal-policy), Trust policy correct (check AssumeRolePolicyDocument). Run all three before opening a support ticket.

Gotcha: Load Balancer Not Created¶

Symptoms: Service type LoadBalancer stays in Pending

# Check events
kubectl describe svc grokdevops -n grokdevops

# Common causes:
# 1. No AWS LB Controller installed (for ALB)
# 2. Subnet not tagged correctly
# 3. IAM permissions missing for the controller
# 4. Security group quota reached

# Check subnet tags (required for auto-discovery)
aws ec2 describe-subnets --subnet-ids subnet-abc123 \
  --query 'Subnets[].Tags[?Key==`kubernetes.io/role/elb`]'

Pattern: Cost-Aware Node Groups¶

# Production: on-demand, multiple AZs
# Dev/staging: spot instances, single AZ
# Batch: spot instances with diversified types

# EKS managed node group with spot
aws eks create-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name spot-workers \
  --capacity-type SPOT \
  --instance-types m5.large m5.xlarge m6i.large r5.large \
  --scaling-config minSize=0,maxSize=10,desiredSize=2

Pattern: ECR Image Caching¶

Pulling from ECR through NAT Gateway costs money. Use VPC endpoints:

# Create ECR VPC endpoints (3 needed)
for svc in ecr.api ecr.dkr s3; do
  aws ec2 create-vpc-endpoint \
    --vpc-id vpc-abc123 \
    --service-name com.amazonaws.us-east-1.$svc \
    --vpc-endpoint-type $([ "$svc" = "s3" ] && echo "Gateway" || echo "Interface") \
    --subnet-ids subnet-abc123 \
    --security-group-ids sg-abc123
done

Scale note: ECR pull-through cache (aws ecr create-pull-through-cache-rule) lets your cluster cache public images (Docker Hub, Quay, GitHub Container Registry) in your private ECR. This eliminates Docker Hub rate limits and NAT Gateway costs for repeated image pulls across nodes.

Emergency: Node Not Joining Cluster¶

# Check node status
kubectl get nodes

# SSH to the node and check kubelet
ssh ec2-user@<node-ip>
journalctl -u kubelet --no-pager --since "10 min ago"

# Common causes:
# 1. Security group blocking 443 to EKS API endpoint
# 2. Node IAM role missing required policies
# 3. DNS resolution failing (check VPC DNS settings)
# 4. Bootstrap token expired

Default trap: EKS API endpoint defaults to public access. If you switch to private-only, nodes in public subnets cannot reach the API server. Either keep a public endpoint with IP allowlisting, or ensure all node subnets have routes to the VPC endpoint.

Quick Reference¶

Cheatsheet: Cloud-Deep-Dive