Skip to content

GrokDevOps Wiki

Cloud Deep Dive Cheatsheet

grokdatum/grokdevops

Cloud Provider Deep-Dive Cheat Sheet¶

Under the hood: IRSA (AWS) and Workload Identity (GCP) both work by projecting a signed service account token into the pod. The cloud IAM service trusts the cluster's OIDC issuer, verifies the token's audience and subject claims, and issues short-lived cloud credentials. No long-lived keys are stored anywhere — credentials rotate automatically every ~12 hours.

IAM / Identity Federation¶

AWS EKS: IRSA (IAM Roles for Service Accounts)¶

# 1. Annotate ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-reader
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123:role/s3-reader

# 2. Create IAM role with OIDC trust
aws iam create-role --role-name s3-reader \
  --assume-role-policy-document file://trust-policy.json

# 3. Attach permissions
aws iam attach-role-policy --role-name s3-reader \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

GCP GKE: Workload Identity¶

apiVersion: v1
kind: ServiceAccount
metadata:
  name: gcs-reader
  annotations:
    iam.gke.io/gcp-service-account: gcs-reader@project.iam.gserviceaccount.com

gcloud iam service-accounts add-iam-policy-binding \
  gcs-reader@project.iam.gserviceaccount.com \
  --member="serviceAccount:project.svc.id.goog[namespace/gcs-reader]" \
  --role="roles/iam.workloadIdentityUser"

Load Balancers¶

Type	Layer	Use Case	Static IP	Cost
ALB (AWS)	7	HTTP/HTTPS, path routing	No	$$
NLB (AWS)	4	TCP/UDP, gRPC, low latency	Yes	$
CLB (AWS)	4/7	Legacy, avoid for new	No	$
Google LB	7	HTTP(S), global	Yes (anycast)	$$

# AWS NLB via Service
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: external
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
spec:
  type: LoadBalancer

# AWS ALB via Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
spec:
  ingressClassName: alb  # preferred over kubernetes.io/ingress.class annotation

VPC / Networking¶

Gotcha: AWS ALB does not support static IP addresses — use NLB if you need a fixed IP (e.g., for DNS A records or firewall whitelisting). ALB IPs change as AWS scales the load balancer. The workaround is to put a Global Accelerator in front of the ALB.

Subnet IP Exhaustion (EKS)¶

# Check available IPs
aws ec2 describe-subnets --subnet-ids subnet-xxx \
  --query 'Subnets[].AvailableIpAddressCount'

# EKS VPC CNI: IPs per node = (max ENIs) × (IPs per ENI) - 1
# m5.large: 3 ENIs × 10 IPs = 29 pod IPs

# Fix: Enable prefix delegation (16 IPs per slot instead of 1)
kubectl set env daemonset aws-node -n kube-system \
  ENABLE_PREFIX_DELEGATION=true

# Or: Add secondary CIDR
aws ec2 associate-vpc-cidr-block --vpc-id vpc-xxx \
  --cidr-block 100.64.0.0/16

Security Groups for Pods (EKS)¶

apiVersion: vpcresources.k8s.aws/v1beta1
kind: SecurityGroupPolicy
metadata:
  name: db-access
spec:
  podSelector:
    matchLabels:
      role: backend
  securityGroups:
    groupIds:
    - sg-xxx  # Allow DB access

Storage¶

AWS	GCP	Use Case	IOPS
gp3	pd-balanced	General purpose	3,000-16,000
io2	pd-ssd	Databases	Up to 256,000
st1	pd-standard	Throughput	Low
Instance store	Local SSD	Temp/cache	Very high

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "5000"
  throughput: "250"
allowVolumeExpansion: true
reclaimPolicy: Retain

Default trap: EKS VPC CNI assigns one IP per pod from the node's subnet. An m5.large supports only 29 pod IPs — you can exhaust subnet IPs quickly with many small pods. Enable prefix delegation (ENABLE_PREFIX_DELEGATION=true) to get 16 IPs per ENI slot instead of 1, dramatically increasing pod density.

Node Groups / Instance Strategy¶

System workloads    → On-demand, m5.large, tainted
Web/API (stateless) → Spot, multi-instance-type, multi-AZ
Batch jobs          → Spot, scale-to-zero capable
Databases           → On-demand, local NVMe (i3/i4)
ML training         → Spot GPU (p3/g5), checkpointing

Cost Quick Reference¶

Strategy	Savings	Risk
Spot instances	60-90%	Interruption (2 min warning)
Savings Plans (1yr)	30-40%	Commitment
Savings Plans (3yr)	50-60%	Longer commitment
Right-sizing	20-50%	None
Scheduled shutdown (dev)	65%	Availability

Pages that link here¶