Skip to content

Portal | Level: L1: Foundations | Domain: DevOps & Tooling

Cloud Operations Basics - Primer

Why This Matters

Cloud is where most infrastructure runs today. Whether it's AWS, GCP, or Azure, the concepts are the same: IAM controls access, networking connects services, compute runs your workloads, and storage persists your data. As a DevOps engineer, you need to understand these primitives, troubleshoot connectivity issues, manage costs, and design secure architectures. Cloud-agnostic understanding transfers across providers.

Core Concepts

IAM (Identity and Access Management)

IAM is the security control plane for everything in the cloud. Every API call is authenticated and authorized through IAM.

Fun fact: AWS IAM was launched in 2010, three years after S3 and EC2. Before IAM, every AWS account had a single root credential — shared among the entire team. The introduction of IAM roles in 2012 was a turning point: for the first time, EC2 instances could assume permissions without long-lived access keys baked into the instance.

Key entities:

Entity What it is Use case
User Human identity with credentials Admin console access
Group Collection of users Developers, SREs, Read-only
Role Set of permissions, assumable EC2 instances, Lambda, CI/CD
Policy JSON document defining permissions Attached to users, groups, or roles
Service Account Non-human identity (GCP/Azure term) Applications, automation

Policy structure (AWS example):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowS3Read",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-bucket",
        "arn:aws:s3:::my-bucket/*"
      ],
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": "10.0.0.0/8"
        }
      }
    }
  ]
}

Core principles: - Use roles instead of long-lived access keys wherever possible - Assign permissions to groups, not individual users - Start with zero permissions, add as needed - Use conditions to restrict when policies apply (IP, time, MFA) - Enable MFA on all human accounts, especially admins - Review and prune unused permissions regularly

VPC and Networking

Name origin: The term "VPC" was introduced by AWS in 2009. Before VPC, all EC2 instances launched into a single shared network called "EC2-Classic." VPC gave each customer an isolated virtual network with their own IP space, subnets, and routing tables. EC2-Classic was fully retired in August 2022. GCP uses the term "VPC Network" and Azure uses "Virtual Network (VNet)" — different names, same concept.

A VPC (Virtual Private Cloud) is your isolated network in the cloud. All cloud resources live inside a VPC.

VPC: 10.0.0.0/16
├── Public Subnet: 10.0.1.0/24  (has route to Internet Gateway)
│   ├── Load Balancer
│   └── NAT Gateway
├── Private Subnet: 10.0.10.0/24 (routes to NAT for outbound only)
│   ├── Application servers
│   └── Worker nodes
└── Private Subnet: 10.0.20.0/24 (no internet route)
    └── Database servers

Key components: - Subnets: segments of the VPC CIDR, each in one Availability Zone - Route tables: control where traffic flows - Internet Gateway (IGW): connects VPC to the internet - NAT Gateway: allows private subnets to reach the internet (outbound only) - Security Groups: stateful firewalls on instances (allow rules only) - Network ACLs (NACLs): stateless firewalls on subnets (allow + deny rules)

Security Groups vs NACLs:

Feature Security Group NACL
Level Instance (ENI) Subnet
Stateful Yes (return traffic auto-allowed) No (must allow both directions)
Rules Allow only Allow and Deny
Evaluation All rules evaluated Rules evaluated in number order
Default Deny all inbound, allow all outbound Allow all

Compute Primitives

Service What When to use
VMs (EC2, GCE, Azure VM) Full virtual machines Long-running services, stateful workloads
Containers (ECS, GKE, AKS) Managed container orchestration Microservices, stateless workloads
Serverless (Lambda, Cloud Functions) Event-driven code execution API handlers, event processing, cron jobs
Kubernetes (EKS, GKE, AKS) Managed K8s control plane Complex container orchestration

Instance types (AWS naming convention):

m5.xlarge
│ │ │
│ │ └── Size (nano, micro, small, medium, large, xlarge, 2xlarge...)
│ └──── Generation (higher = newer, usually cheaper)
└────── Family (m=general, c=compute, r=memory, t=burstable, g=GPU)

Remember: AWS instance family mnemonic: "Most workloads (general), Compute-heavy, RAM-heavy, Tiny/burstable, GPU." Each generation bump (m5 to m6i to m7i) typically brings 10-20% better price/performance — always prefer the latest generation.

Gotcha: T-series (burstable) instances have CPU credits. When credits run out, performance drops to the baseline (e.g., 20% of a vCPU for t3.micro). This causes mysterious periodic slowdowns that look like application bugs. Use CloudWatch → CPUCreditBalance to monitor. For production workloads that need consistent performance, use M-series.

Right-sizing: don't guess instance sizes. Start small, monitor actual CPU/memory usage, adjust. Or use burstable instances (T-series) for variable workloads.

Storage Primitives

Service Type Use case
Block storage (EBS, Persistent Disk) Disk volumes attached to VMs Databases, OS disks
Object storage (S3, GCS, Blob) Key-value for files/objects Backups, static assets, data lakes
File storage (EFS, Filestore) Shared filesystem (NFS) Shared configs, CMS content

S3 storage classes (cost optimization): - Standard: frequently accessed data - Infrequent Access (IA): less accessed, cheaper storage, retrieval fee - Glacier: archive, minutes to hours for retrieval - Lifecycle rules automate transitions between classes

Remember: S3 storage class order by cost (high to low): Standard → IA → One Zone-IA → Glacier Instant → Glacier Flexible → Glacier Deep Archive. Mnemonic: "SIOGgd" — "Store It Once, Glacier gets deeper." Deep Archive is the cheapest at ~$1/TB/month, but retrieval takes 12-48 hours.

Database Services

Type AWS GCP Azure When to use
Relational RDS, Aurora Cloud SQL SQL Database Structured data, transactions
Key-value DynamoDB Firestore Cosmos DB High throughput, simple queries
Cache ElastiCache Memorystore Cache for Redis Session data, hot data
Document DocumentDB Firestore Cosmos DB Flexible schemas

Managed databases handle backups, patching, replication, and failover. Use them unless you have a strong reason to self-manage.

War story: Self-managing databases in the cloud is the most common way teams accidentally burn engineering hours. A managed RDS instance costs more per hour than a raw EC2 instance, but the hidden cost of self-managed is on-call rotation for backups, failover, patching, and storage scaling. Unless your workload requires specific tuning that managed services do not expose, choose managed.

Cost Awareness

Cloud costs are the number one operational surprise for teams moving to the cloud.

Remember: Cost optimization mnemonic: "RRSDT" — Right-size, Reserve, Spot, storage lifecycle (Demote), Tag. These five strategies address the vast majority of cloud waste. Tagging is the foundation — without tags attributing cost to teams, nobody owns the bill and nobody optimizes.

Cost optimization strategies: 1. Right-size instances: most instances are overprovisioned. Check actual usage. 2. Reserved/committed use: 1-3 year commitments save 30-60% on compute. 3. Spot/preemptible instances: 60-90% cheaper for fault-tolerant workloads. 4. Storage lifecycle: move old data to cheaper storage tiers automatically. 5. Delete unused resources: orphaned EBS volumes, unused Elastic IPs, stopped instances still cost money. 6. Tag everything: without tags, you can't attribute costs to teams or projects.

# AWS cost investigation
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-01-31 \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE

What Experienced People Know

  • The cloud console is for reading, not writing. Everything you create should be in Terraform, CloudFormation, or another IaC tool. Console-created resources are untracked and will drift.
  • Security groups are the most common cause of "it's not working." Before debugging the application, check the security group rules.
  • Cloud networking is not magic. It follows the same principles as physical networking: subnets, routes, firewalls. The abstraction just hides some complexity.
  • Cost alerts should be set up on day one. A runaway process, misconfigured autoscaler, or forgotten resource can generate a five-figure bill in days.
  • Availability Zones are your first layer of redundancy. Always deploy across at least two AZs.
  • Read the shared responsibility model. The cloud provider secures the infrastructure; you secure your configuration, data, and access.

Wiki Navigation

Prerequisites

Next Steps