Portal | Level: L1: Foundations | Domain: DevOps & Tooling
Cloud Operations Basics - Primer¶
Why This Matters¶
Cloud is where most infrastructure runs today. Whether it's AWS, GCP, or Azure, the concepts are the same: IAM controls access, networking connects services, compute runs your workloads, and storage persists your data. As a DevOps engineer, you need to understand these primitives, troubleshoot connectivity issues, manage costs, and design secure architectures. Cloud-agnostic understanding transfers across providers.
Core Concepts¶
IAM (Identity and Access Management)¶
IAM is the security control plane for everything in the cloud. Every API call is authenticated and authorized through IAM.
Fun fact: AWS IAM was launched in 2010, three years after S3 and EC2. Before IAM, every AWS account had a single root credential — shared among the entire team. The introduction of IAM roles in 2012 was a turning point: for the first time, EC2 instances could assume permissions without long-lived access keys baked into the instance.
Key entities:
| Entity | What it is | Use case |
|---|---|---|
| User | Human identity with credentials | Admin console access |
| Group | Collection of users | Developers, SREs, Read-only |
| Role | Set of permissions, assumable | EC2 instances, Lambda, CI/CD |
| Policy | JSON document defining permissions | Attached to users, groups, or roles |
| Service Account | Non-human identity (GCP/Azure term) | Applications, automation |
Policy structure (AWS example):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowS3Read",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-bucket",
"arn:aws:s3:::my-bucket/*"
],
"Condition": {
"IpAddress": {
"aws:SourceIp": "10.0.0.0/8"
}
}
}
]
}
Core principles: - Use roles instead of long-lived access keys wherever possible - Assign permissions to groups, not individual users - Start with zero permissions, add as needed - Use conditions to restrict when policies apply (IP, time, MFA) - Enable MFA on all human accounts, especially admins - Review and prune unused permissions regularly
VPC and Networking¶
Name origin: The term "VPC" was introduced by AWS in 2009. Before VPC, all EC2 instances launched into a single shared network called "EC2-Classic." VPC gave each customer an isolated virtual network with their own IP space, subnets, and routing tables. EC2-Classic was fully retired in August 2022. GCP uses the term "VPC Network" and Azure uses "Virtual Network (VNet)" — different names, same concept.
A VPC (Virtual Private Cloud) is your isolated network in the cloud. All cloud resources live inside a VPC.
VPC: 10.0.0.0/16
├── Public Subnet: 10.0.1.0/24 (has route to Internet Gateway)
│ ├── Load Balancer
│ └── NAT Gateway
├── Private Subnet: 10.0.10.0/24 (routes to NAT for outbound only)
│ ├── Application servers
│ └── Worker nodes
└── Private Subnet: 10.0.20.0/24 (no internet route)
└── Database servers
Key components: - Subnets: segments of the VPC CIDR, each in one Availability Zone - Route tables: control where traffic flows - Internet Gateway (IGW): connects VPC to the internet - NAT Gateway: allows private subnets to reach the internet (outbound only) - Security Groups: stateful firewalls on instances (allow rules only) - Network ACLs (NACLs): stateless firewalls on subnets (allow + deny rules)
Security Groups vs NACLs:
| Feature | Security Group | NACL |
|---|---|---|
| Level | Instance (ENI) | Subnet |
| Stateful | Yes (return traffic auto-allowed) | No (must allow both directions) |
| Rules | Allow only | Allow and Deny |
| Evaluation | All rules evaluated | Rules evaluated in number order |
| Default | Deny all inbound, allow all outbound | Allow all |
Compute Primitives¶
| Service | What | When to use |
|---|---|---|
| VMs (EC2, GCE, Azure VM) | Full virtual machines | Long-running services, stateful workloads |
| Containers (ECS, GKE, AKS) | Managed container orchestration | Microservices, stateless workloads |
| Serverless (Lambda, Cloud Functions) | Event-driven code execution | API handlers, event processing, cron jobs |
| Kubernetes (EKS, GKE, AKS) | Managed K8s control plane | Complex container orchestration |
Instance types (AWS naming convention):
m5.xlarge
│ │ │
│ │ └── Size (nano, micro, small, medium, large, xlarge, 2xlarge...)
│ └──── Generation (higher = newer, usually cheaper)
└────── Family (m=general, c=compute, r=memory, t=burstable, g=GPU)
Remember: AWS instance family mnemonic: "Most workloads (general), Compute-heavy, RAM-heavy, Tiny/burstable, GPU." Each generation bump (m5 to m6i to m7i) typically brings 10-20% better price/performance — always prefer the latest generation.
Gotcha: T-series (burstable) instances have CPU credits. When credits run out, performance drops to the baseline (e.g., 20% of a vCPU for t3.micro). This causes mysterious periodic slowdowns that look like application bugs. Use
CloudWatch → CPUCreditBalanceto monitor. For production workloads that need consistent performance, use M-series.
Right-sizing: don't guess instance sizes. Start small, monitor actual CPU/memory usage, adjust. Or use burstable instances (T-series) for variable workloads.
Storage Primitives¶
| Service | Type | Use case |
|---|---|---|
| Block storage (EBS, Persistent Disk) | Disk volumes attached to VMs | Databases, OS disks |
| Object storage (S3, GCS, Blob) | Key-value for files/objects | Backups, static assets, data lakes |
| File storage (EFS, Filestore) | Shared filesystem (NFS) | Shared configs, CMS content |
S3 storage classes (cost optimization): - Standard: frequently accessed data - Infrequent Access (IA): less accessed, cheaper storage, retrieval fee - Glacier: archive, minutes to hours for retrieval - Lifecycle rules automate transitions between classes
Remember: S3 storage class order by cost (high to low): Standard → IA → One Zone-IA → Glacier Instant → Glacier Flexible → Glacier Deep Archive. Mnemonic: "SIOGgd" — "Store It Once, Glacier gets deeper." Deep Archive is the cheapest at ~$1/TB/month, but retrieval takes 12-48 hours.
Database Services¶
| Type | AWS | GCP | Azure | When to use |
|---|---|---|---|---|
| Relational | RDS, Aurora | Cloud SQL | SQL Database | Structured data, transactions |
| Key-value | DynamoDB | Firestore | Cosmos DB | High throughput, simple queries |
| Cache | ElastiCache | Memorystore | Cache for Redis | Session data, hot data |
| Document | DocumentDB | Firestore | Cosmos DB | Flexible schemas |
Managed databases handle backups, patching, replication, and failover. Use them unless you have a strong reason to self-manage.
War story: Self-managing databases in the cloud is the most common way teams accidentally burn engineering hours. A managed RDS instance costs more per hour than a raw EC2 instance, but the hidden cost of self-managed is on-call rotation for backups, failover, patching, and storage scaling. Unless your workload requires specific tuning that managed services do not expose, choose managed.
Cost Awareness¶
Cloud costs are the number one operational surprise for teams moving to the cloud.
Remember: Cost optimization mnemonic: "RRSDT" — Right-size, Reserve, Spot, storage lifecycle (Demote), Tag. These five strategies address the vast majority of cloud waste. Tagging is the foundation — without tags attributing cost to teams, nobody owns the bill and nobody optimizes.
Cost optimization strategies: 1. Right-size instances: most instances are overprovisioned. Check actual usage. 2. Reserved/committed use: 1-3 year commitments save 30-60% on compute. 3. Spot/preemptible instances: 60-90% cheaper for fault-tolerant workloads. 4. Storage lifecycle: move old data to cheaper storage tiers automatically. 5. Delete unused resources: orphaned EBS volumes, unused Elastic IPs, stopped instances still cost money. 6. Tag everything: without tags, you can't attribute costs to teams or projects.
# AWS cost investigation
aws ce get-cost-and-usage \
--time-period Start=2024-01-01,End=2024-01-31 \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE
What Experienced People Know¶
- The cloud console is for reading, not writing. Everything you create should be in Terraform, CloudFormation, or another IaC tool. Console-created resources are untracked and will drift.
- Security groups are the most common cause of "it's not working." Before debugging the application, check the security group rules.
- Cloud networking is not magic. It follows the same principles as physical networking: subnets, routes, firewalls. The abstraction just hides some complexity.
- Cost alerts should be set up on day one. A runaway process, misconfigured autoscaler, or forgotten resource can generate a five-figure bill in days.
- Availability Zones are your first layer of redundancy. Always deploy across at least two AZs.
- Read the shared responsibility model. The cloud provider secures the infrastructure; you secure your configuration, data, and access.
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Next Steps¶
- AWS CloudWatch (Topic Pack, L2)
- AWS EC2 (Topic Pack, L1)
- AWS IAM (Topic Pack, L1)
- AWS Networking (Topic Pack, L1)
- AWS Route 53 (Topic Pack, L2)
- AWS S3 Deep Dive (Topic Pack, L1)
- Cloud Deep Dive (Topic Pack, L2)
- Cloud Ops Drills (Drill, L1)