Infrastructure as Code with Terraform - Street Ops¶

What experienced Terraform operators know that tutorials don't teach.

Incident Runbooks¶

State File Corruption or Loss¶

1. If using remote backend with versioning (S3):
   - Check S3 bucket versioning: list previous versions of the state file
   - Restore the most recent valid version
   aws s3api list-object-versions --bucket my-terraform-state --prefix prod/terraform.tfstate
   aws s3api get-object --bucket my-terraform-state --key prod/terraform.tfstate \
     --version-id <version-id> recovered.tfstate

2. If state is truly lost:
   - DON'T panic. Infrastructure still exists, you just lost the mapping.
   - Use terraform import to rebuild state:
     terraform import aws_instance.web i-0abc123def456
     terraform import aws_vpc.main vpc-0abc123
   - For large environments, use terraformer or aztfexport to generate config + state
   - After import, run terraform plan to verify no changes are planned

3. Prevention:
   - Always use remote backend with versioning
   - Enable state locking (DynamoDB for S3, built-in for Terraform Cloud)
   - Never edit state files manually (use terraform state commands)
   - Back up state before risky operations:
     terraform state pull > backup-$(date +%Y%m%d).tfstate

Drift Detection and Reconciliation¶

1. Detect drift:
   terraform plan
   # If someone changed infrastructure outside Terraform, you'll see
   # unexpected changes: "~ resource will be updated" or
   # resources showing changes you didn't make

2. Understand the drift:
   - Was it an intentional change by another team? (Coordinate, don't overwrite)
   - Was it an auto-scaling event or AWS service behavior?
   - Was it a manual hotfix during an incident?

3. Reconcile:
   Option A: Accept the drift (update your config to match reality)
     - Edit .tf files to match the current state of infrastructure
     - Run terraform plan to verify "No changes"

   Option B: Revert the drift (let Terraform enforce the config)
     - Run terraform apply to push infrastructure back to config state
     - CAUTION: this may cause downtime if the drift was a scaling change

   Option C: Import and refactor
     - terraform state show aws_instance.web  # See current state
     - Update config to reflect desired state
     - terraform plan to verify

4. Prevention:
   - Run terraform plan in CI on a schedule (drift detection job)
   - Use Terraform Cloud or Spacelift for automatic drift detection
   - Establish team policy: all changes through Terraform, no console clicks

Importing Existing Resources¶

1. When you need to import:
   - Migrating from console-managed to Terraform-managed infrastructure
   - Someone created resources manually that should be in code
   - After a state loss recovery

2. Import workflow:
   # Step 1: Write the resource config (empty or partial is OK to start)
   resource "aws_instance" "legacy_web" {
     # Will be filled in after import
   }

   # Step 2: Import
   terraform import aws_instance.legacy_web i-0abc123def456

   # Step 3: Run plan to see what config is needed
   terraform plan
   # Terraform shows what attributes differ from your empty config

   # Step 4: Fill in the config to match the imported state
   # Copy attribute values from terraform state show output

   # Step 5: Verify
   terraform plan
   # Should show "No changes" when config matches state

3. Bulk import (Terraform 1.5+):
   import {
     to = aws_instance.legacy_web
     id = "i-0abc123def456"
   }

   # Generate config automatically:
   terraform plan -generate-config-out=generated.tf

4. Gotchas:
   - Not all resources support import (check provider docs)
   - Import only updates state, not your config files
   - Some attributes are computed (you can't set them in config)
   - After import, always verify with terraform plan

State Surgery¶

When you need to manipulate state directly:

# Move a resource (rename or restructure)
terraform state mv aws_instance.old_name aws_instance.new_name
terraform state mv module.old_module.aws_instance.web module.new_module.aws_instance.web

# Remove a resource from state (Terraform forgets it, doesn't destroy it)
terraform state rm aws_instance.temp_server
# Use case: you want to stop managing a resource without destroying it

# List all resources in state
terraform state list

# Show details of a resource
terraform state show aws_instance.web

# Replace a provider (namespace changes, forks)
terraform state replace-provider hashicorp/aws registry.example.com/aws

CAUTION: Always backup state before surgery:
terraform state pull > backup.tfstate

Workspace Patterns¶

1. CLI workspaces (built-in):
   terraform workspace new staging
   terraform workspace new production
   terraform workspace select staging
   terraform workspace list

   # Use in config:
   resource "aws_instance" "web" {
     instance_type = terraform.workspace == "production" ? "t3.large" : "t3.micro"
     tags = { Environment = terraform.workspace }
   }

   Downside: all workspaces share the same config. Differences are only
   in variable values. Fine for simple cases, fragile for complex ones.

2. Directory-per-environment (preferred for production):
   environments/
     staging/
       main.tf        # Calls shared modules with staging variables
       terraform.tfvars
       backend.tf     # Separate state per environment
     production/
       main.tf
       terraform.tfvars
       backend.tf

   Advantage: environments can evolve independently. You can test
   module upgrades in staging without affecting production.

Gotchas & War Stories¶

The accidental destroy Someone ran terraform apply without reading the plan. A destroy was hidden in 200 lines of output. Production database gone. Prevention: always run plan -out=tfplan and review. In CI, post the plan as a PR comment so reviewers can see it. Use prevent_destroy lifecycle on critical resources:

resource "aws_db_instance" "main" {
  # ...
  lifecycle {
    prevent_destroy = true
  }
}

Provider version upgrades break things A provider update changed resource behavior or added required attributes. Your terraform plan now shows unexpected changes or errors. Prevention: pin provider versions tightly (= 5.31.0 not ~> 5.0). Update providers intentionally, not accidentally. Test upgrades in staging first.

The state lock that won't release Someone's terraform apply crashed or was killed, leaving the DynamoDB lock. Nobody can run Terraform. Fix:

terraform force-unlock <lock-id>
# The lock ID is shown in the error message

Only do this after confirming no other apply is actually running.

Circular dependencies Resource A depends on Resource B which depends on Resource A. Terraform can't resolve this. Common scenario: security group A allows traffic from security group B, and vice versa. Fix: create the security groups first, then add the rules as separate resources.

Secrets in state Terraform state stores resource attributes in plaintext, including database passwords, API keys, and other secrets. Even if you mark a variable as sensitive, the value is in the state file. Always encrypt state at rest (S3 server-side encryption) and restrict access to the state backend.

The count/index trap Using count with a list: remove an item from the middle and every subsequent resource gets recreated because indices shift. Switch to for_each with a map:

# BAD
variable "servers" { default = ["web1", "web2", "web3"] }
resource "aws_instance" "server" {
  count = length(var.servers)
  tags  = { Name = var.servers[count.index] }
}

# GOOD
variable "servers" { default = { web1 = {}, web2 = {}, web3 = {} } }
resource "aws_instance" "server" {
  for_each = var.servers
  tags     = { Name = each.key }
}

Essential Commands Cheatsheet¶

# Initialization
terraform init                          # Download providers, configure backend
terraform init -upgrade                 # Upgrade providers to latest allowed

# Planning and applying
terraform plan                          # Dry run
terraform plan -out=tfplan              # Save plan for exact apply
terraform apply tfplan                  # Apply saved plan
terraform apply -auto-approve           # Skip confirmation (CI only!)

# State operations
terraform state list                    # All managed resources
terraform state show <resource>         # Resource details
terraform state mv <old> <new>          # Rename/move resource
terraform state rm <resource>           # Stop managing (don't destroy)
terraform state pull                    # Download state to stdout

# Import
terraform import <resource> <id>        # Import existing infrastructure

# Formatting and validation
terraform fmt -recursive                # Format all .tf files
terraform validate                      # Check config syntax

# Output
terraform output                        # Show all outputs
terraform output -json                  # JSON format for scripts

Quick Reference¶

Cheatsheet: Terraform
Deep Dive: Terraform State Internals