Portal | Level: L1: Foundations | Domain: DevOps & Tooling

Cloud Operations Basics - Primer¶

Why This Matters¶

Cloud is where most infrastructure runs today. Whether it's AWS, GCP, or Azure, the concepts are the same: IAM controls access, networking connects services, compute runs your workloads, and storage persists your data. As a DevOps engineer, you need to understand these primitives, troubleshoot connectivity issues, manage costs, and design secure architectures. Cloud-agnostic understanding transfers across providers.

Core Concepts¶

IAM (Identity and Access Management)¶

IAM is the security control plane for everything in the cloud. Every API call is authenticated and authorized through IAM.

Fun fact: AWS IAM was launched in 2010, three years after S3 and EC2. Before IAM, every AWS account had a single root credential — shared among the entire team. The introduction of IAM roles in 2012 was a turning point: for the first time, EC2 instances could assume permissions without long-lived access keys baked into the instance.

Key entities:

Entity	What it is	Use case
User	Human identity with credentials	Admin console access
Group	Collection of users	Developers, SREs, Read-only
Role	Set of permissions, assumable	EC2 instances, Lambda, CI/CD
Policy	JSON document defining permissions	Attached to users, groups, or roles
Service Account	Non-human identity (GCP/Azure term)	Applications, automation

Policy structure (AWS example):

href="#__codelineno-0-1">{ "Version": "2012-10-17", "Statement": [ { "Sid": "AllowS3Read", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::my-bucket", "arn:aws:s3:::my-bucket/*" ], "Condition": { "IpAddress": { "aws:SourceIp": "10.0.0.0/8" } } } ] }

Core principles: - Use roles instead of long-lived access keys wherever possible - Assign permissions to groups, not individual users - Start with zero permissions, add as needed - Use conditions to restrict when policies apply (IP, time, MFA) - Enable MFA on all human accounts, especially admins - Review and prune unused permissions regularly

VPC and Networking¶

Name origin: The term "VPC" was introduced by AWS in 2009. Before VPC, all EC2 instances launched into a single shared network called "EC2-Classic." VPC gave each customer an isolated virtual network with their own IP space, subnets, and routing tables. EC2-Classic was fully retired in August 2022. GCP uses the term "VPC Network" and Azure uses "Virtual Network (VNet)" — different names, same concept.

A VPC (Virtual Private Cloud) is your isolated network in the cloud. All cloud resources live inside a VPC.

VPC: 10.0.0.0/16
├── Public Subnet: 10.0.1.0/24  (has route to Internet Gateway)
│   ├── Load Balancer
│   └── NAT Gateway
├── Private Subnet: 10.0.10.0/24 (routes to NAT for outbound only)
│   ├── Application servers
│   └── Worker nodes
└── Private Subnet: 10.0.20.0/24 (no internet route)
    └── Database servers

Key components: - Subnets: segments of the VPC CIDR, each in one Availability Zone - Route tables: control where traffic flows - Internet Gateway (IGW): connects VPC to the internet - NAT Gateway: allows private subnets to reach the internet (outbound only) - Security Groups: stateful firewalls on instances (allow rules only) - Network ACLs (NACLs): stateless firewalls on subnets (allow + deny rules)

Security Groups vs NACLs:

Feature	Security Group	NACL
Level	Instance (ENI)	Subnet
Stateful	Yes (return traffic auto-allowed)	No (must allow both directions)
Rules	Allow only	Allow and Deny
Evaluation	All rules evaluated	Rules evaluated in number order
Default	Deny all inbound, allow all outbound	Allow all

Compute Primitives¶

Service	What	When to use
VMs (EC2, GCE, Azure VM)	Full virtual machines	Long-running services, stateful workloads
Containers (ECS, GKE, AKS)	Managed container orchestration	Microservices, stateless workloads
Serverless (Lambda, Cloud Functions)	Event-driven code execution	API handlers, event processing, cron jobs
Kubernetes (EKS, GKE, AKS)	Managed K8s control plane	Complex container orchestration

Instance types (AWS naming convention):

m5.xlarge
│ │ │
│ │ └── Size (nano, micro, small, medium, large, xlarge, 2xlarge...)
│ └──── Generation (higher = newer, usually cheaper)
└────── Family (m=general, c=compute, r=memory, t=burstable, g=GPU)

Remember: AWS instance family mnemonic: "Most workloads (general), Compute-heavy, RAM-heavy, Tiny/burstable, GPU." Each generation bump (m5 to m6i to m7i) typically brings 10-20% better price/performance — always prefer the latest generation.

Gotcha: T-series (burstable) instances have CPU credits. When credits run out, performance drops to the baseline (e.g., 20% of a vCPU for t3.micro). This causes mysterious periodic slowdowns that look like application bugs. Use CloudWatch → CPUCreditBalance to monitor. For production workloads that need consistent performance, use M-series.

Right-sizing: don't guess instance sizes. Start small, monitor actual CPU/memory usage, adjust. Or use burstable instances (T-series) for variable workloads.

Storage Primitives¶

Service	Type	Use case
Block storage (EBS, Persistent Disk)	Disk volumes attached to VMs	Databases, OS disks
Object storage (S3, GCS, Blob)	Key-value for files/objects	Backups, static assets, data lakes
File storage (EFS, Filestore)	Shared filesystem (NFS)	Shared configs, CMS content

S3 storage classes (cost optimization): - Standard: frequently accessed data - Infrequent Access (IA): less accessed, cheaper storage, retrieval fee - Glacier: archive, minutes to hours for retrieval - Lifecycle rules automate transitions between classes

Remember: S3 storage class order by cost (high to low): Standard → IA → One Zone-IA → Glacier Instant → Glacier Flexible → Glacier Deep Archive. Mnemonic: "SIOGgd" — "Store It Once, Glacier gets deeper." Deep Archive is the cheapest at ~$1/TB/month, but retrieval takes 12-48 hours.

Database Services¶

Type	AWS	GCP	Azure	When to use
Relational	RDS, Aurora	Cloud SQL	SQL Database	Structured data, transactions
Key-value	DynamoDB	Firestore	Cosmos DB	High throughput, simple queries
Cache	ElastiCache	Memorystore	Cache for Redis	Session data, hot data
Document	DocumentDB	Firestore	Cosmos DB	Flexible schemas

Managed databases handle backups, patching, replication, and failover. Use them unless you have a strong reason to self-manage.

War story: Self-managing databases in the cloud is the most common way teams accidentally burn engineering hours. A managed RDS instance costs more per hour than a raw EC2 instance, but the hidden cost of self-managed is on-call rotation for backups, failover, patching, and storage scaling. Unless your workload requires specific tuning that managed services do not expose, choose managed.

Cost Awareness¶

Cloud costs are the number one operational surprise for teams moving to the cloud.

Remember: Cost optimization mnemonic: "RRSDT" — Right-size, Reserve, Spot, storage lifecycle (Demote), Tag. These five strategies address the vast majority of cloud waste. Tagging is the foundation — without tags attributing cost to teams, nobody owns the bill and nobody optimizes.

Cost optimization strategies: 1. Right-size instances: most instances are overprovisioned. Check actual usage. 2. Reserved/committed use: 1-3 year commitments save 30-60% on compute. 3. Spot/preemptible instances: 60-90% cheaper for fault-tolerant workloads. 4. Storage lifecycle: move old data to cheaper storage tiers automatically. 5. Delete unused resources: orphaned EBS volumes, unused Elastic IPs, stopped instances still cost money. 6. Tag everything: without tags, you can't attribute costs to teams or projects.

# AWS cost investigation
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-01-31 \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE

What Experienced People Know¶

The cloud console is for reading, not writing. Everything you create should be in Terraform, CloudFormation, or another IaC tool. Console-created resources are untracked and will drift.
Security groups are the most common cause of "it's not working." Before debugging the application, check the security group rules.
Cloud networking is not magic. It follows the same principles as physical networking: subnets, routes, firewalls. The abstraction just hides some complexity.
Cost alerts should be set up on day one. A runaway process, misconfigured autoscaler, or forgotten resource can generate a five-figure bill in days.
Availability Zones are your first layer of redundancy. Always deploy across at least two AZs.
Read the shared responsibility model. The cloud provider secures the infrastructure; you secure your configuration, data, and access.

Prerequisites¶

Linux Ops (Topic Pack, L0)

Next Steps¶

AWS CloudWatch (Topic Pack, L2)
AWS EC2 (Topic Pack, L1)
AWS IAM (Topic Pack, L1)
AWS Networking (Topic Pack, L1)
AWS Route 53 (Topic Pack, L2)
AWS S3 Deep Dive (Topic Pack, L1)
Cloud Deep Dive (Topic Pack, L2)
Cloud Ops Drills (Drill, L1)