Skip to content

Terraform Deep Dive - Street Ops

What experienced Terraform operators deal with daily. Debugging, migrations, CI/CD patterns, and performance.

Debugging State Drift

State drift means the real infrastructure doesn't match what Terraform thinks exists. Someone clicked in the console, an auto-scaler changed instance counts, or a Lambda updated a resource attribute.

Diagnosing Drift

# The first tool: run plan and read every line
terraform plan

# If the plan shows unexpected changes, dig into the specific resource
terraform state show aws_instance.web

# Compare state to reality
aws ec2 describe-instances --instance-ids i-0abc123def456789a \
  --query 'Reservations[].Instances[].{Type:InstanceType,State:State.Name,Tags:Tags}'

Common drift patterns: - Tags changed: Someone added tags in the console. If you don't manage those tags in Terraform, use ignore_changes = [tags] for the specific keys. - Security group rules added: Manual emergency rule added during incident. Add it to Terraform config or remove it manually. - Instance type changed: Someone resized through the console. Plan will show a change. Decide: update config to match, or let Terraform revert it. - Resource deleted outside Terraform: Plan shows will be created. If intentional, run terraform state rm to clean up state. If accidental, apply to recreate.

Scheduled Drift Detection

#!/usr/bin/env bash
# drift-check.sh -- run in CI on a schedule (e.g., daily)
set -euo pipefail

terraform init -input=false
PLAN_OUTPUT=$(terraform plan -detailed-exitcode -input=false 2>&1) || EXIT_CODE=$?

case ${EXIT_CODE:-0} in
  0) echo "No drift detected." ;;
  1) echo "ERROR: Plan failed."; echo "$PLAN_OUTPUT"; exit 1 ;;
  2) echo "DRIFT DETECTED:"; echo "$PLAN_OUTPUT"
     # Send alert to Slack/PagerDuty
     curl -X POST "$SLACK_WEBHOOK" -d "{\"text\": \"Terraform drift detected in $(basename $PWD)\"}"
     ;;
esac

The -detailed-exitcode flag returns 0 (no changes), 1 (error), or 2 (changes detected). This is the foundation of automated drift detection.

State File Corruption Recovery

Symptoms

Error: Failed to load state: ...
Error: Unsupported state file format: ...
Error: state snapshot was created by Terraform v1.x, which is newer than current v1.y

Recovery Steps

# Step 1: Pull the state and inspect it
terraform state pull > current_state.json
python3 -m json.tool current_state.json  # Is it valid JSON?

# Step 2: If S3 backend, check versioning for a good copy
aws s3api list-object-versions \
  --bucket mycompany-terraform-state \
  --prefix prod/network/terraform.tfstate \
  --query 'Versions[*].{Modified:LastModified,VersionId:VersionId,Size:Size}' \
  --output table

# Step 3: Restore a previous version
aws s3api get-object \
  --bucket mycompany-terraform-state \
  --key prod/network/terraform.tfstate \
  --version-id "abc123previousversion" \
  recovered.tfstate

# Step 4: Verify the recovered state is valid
python3 -m json.tool recovered.tfstate > /dev/null && echo "Valid JSON"

# Step 5: Push the recovered state (DANGEROUS -- double-check)
terraform state push recovered.tfstate

# Step 6: Plan to verify recovered state matches reality
terraform plan

If versioning wasn't enabled (lesson learned), you need to rebuild state by importing every resource. Use terraform import for each resource, or tools like terraformer or aztfexport for bulk import.

Migrating State Backends

Moving from local state to S3, or from one S3 bucket to another:

# Step 1: Backup current state
terraform state pull > backup-$(date +%Y%m%d-%H%M%S).tfstate

# Step 2: Add the new backend configuration to your .tf files
# (Change the backend block in your terraform {} block)

# Step 3: Re-initialize with migration
terraform init -migrate-state

# Terraform asks:
# "Do you want to copy existing state to the new backend?"
# Type "yes"

# Step 4: Verify
terraform plan  # Should show "No changes"

# Step 5: If migrating FROM local, delete the local state file
rm terraform.tfstate terraform.tfstate.backup

If migrating between remote backends (e.g., S3 to Terraform Cloud), you may need to pull state, change the backend, then push:

terraform state pull > migration.tfstate
# Edit backend config
terraform init -reconfigure
terraform state push migration.tfstate
terraform plan  # Verify

Moving Resources Between Modules

When you refactor a monolithic config into modules:

Method 1: terraform state mv (imperative)

# Move a resource from root to a module
terraform state mv 'aws_instance.web' 'module.compute.aws_instance.web'

# Move a resource between modules
terraform state mv 'module.old.aws_instance.web' 'module.new.aws_instance.web'

# Move an indexed resource
terraform state mv 'aws_subnet.private[0]' 'module.network.aws_subnet.private["us-east-1a"]'

Method 2: moved blocks (declarative, preferred)

# In your .tf files
moved {
  from = aws_instance.web
  to   = module.compute.aws_instance.web
}

moved {
  from = aws_subnet.private
  to   = module.network.aws_subnet.private
}

Moved blocks are better because they're code-reviewed, version-controlled, and applied by every operator. state mv is a one-time manual operation that others have to coordinate around.

Importing Existing Infrastructure

Single Resource Import

# Traditional imperative import
terraform import aws_instance.web i-0abc123def456789a

# For resources in modules
terraform import module.compute.aws_instance.web i-0abc123def456789a

# For indexed resources
terraform import 'aws_instance.web["app1"]' i-0abc123def456789a

Bulk Import Workflow (Terraform 1.5+)

# 1. Write import blocks for each resource
import {
  to = aws_vpc.main
  id = "vpc-0abc123"
}

import {
  to = aws_subnet.private["us-east-1a"]
  id = "subnet-0def456"
}

import {
  to = aws_security_group.web
  id = "sg-0ghi789"
}
# 2. Generate config from the imports
terraform plan -generate-config-out=generated_resources.tf

# 3. Review generated code, clean it up, split into proper files

# 4. Run plan -- iterate until "No changes"
terraform plan

# 5. Apply to finalize the imports
terraform apply

# 6. Remove the import blocks (they're one-time declarations)

Handling Import Mismatches

After import, terraform plan often shows changes because your config doesn't match reality exactly. Common issues:

  • Default values: AWS sets defaults you didn't specify. Add them to your config.
  • Computed attributes: Some attributes are read-only. Don't try to set them.
  • Ordering: List attributes may be in a different order. Use for_each with sorted keys.
# Show what Terraform thinks the resource looks like after import
terraform state show aws_instance.web

# Compare with your config and adjust until plan shows no changes

Handling Circular Dependencies

Terraform builds a dependency graph and fails if it finds cycles.

Error: Cycle: aws_security_group.a, aws_security_group.b

Breaking Cycles

# WRONG: Circular reference
resource "aws_security_group" "a" {
  ingress {
    security_groups = [aws_security_group.b.id]
  }
}

resource "aws_security_group" "b" {
  ingress {
    security_groups = [aws_security_group.a.id]
  }
}

# FIX: Use separate security group rules
resource "aws_security_group" "a" {
  name = "sg-a"
}

resource "aws_security_group" "b" {
  name = "sg-b"
}

resource "aws_security_group_rule" "a_from_b" {
  type                     = "ingress"
  security_group_id        = aws_security_group.a.id
  source_security_group_id = aws_security_group.b.id
  from_port                = 443
  to_port                  = 443
  protocol                 = "tcp"
}

resource "aws_security_group_rule" "b_from_a" {
  type                     = "ingress"
  security_group_id        = aws_security_group.b.id
  source_security_group_id = aws_security_group.a.id
  from_port                = 8080
  to_port                  = 8080
  protocol                 = "tcp"
}

Debugging Provider Errors

# Enable full debug logging
export TF_LOG=DEBUG
terraform plan 2>debug.log

# Log only provider communication
export TF_LOG=TRACE
export TF_LOG_PROVIDER=DEBUG

# Log to a file
export TF_LOG_PATH=terraform.log
terraform plan

# Disable logging
unset TF_LOG TF_LOG_PATH

Common provider errors and fixes:

  • "Error: creating X: AccessDenied": IAM permissions. Check the AWS role/user has the right policy. Use aws sts get-caller-identity to verify who you are.
  • "Error: Provider produced inconsistent result": Provider bug or API quirk. Pin the provider version and check GitHub issues.
  • "Error: error configuring Terraform AWS Provider: no valid credential sources": AWS credential chain issue. Check AWS_PROFILE, ~/.aws/credentials, instance profile, or environment variables.
  • "Error: timeout while waiting for state": Resource taking longer than expected. Increase timeouts block in the resource.
resource "aws_db_instance" "main" {
  # ...
  timeouts {
    create = "60m"
    update = "60m"
    delete = "30m"
  }
}

Managing Multiple Environments

environments/
  dev/
    main.tf          # Calls shared modules
    variables.tf
    terraform.tfvars  # Dev-specific values
    backend.tf        # Dev state backend
  staging/
    main.tf
    variables.tf
    terraform.tfvars
    backend.tf
  prod/
    main.tf
    variables.tf
    terraform.tfvars
    backend.tf
modules/
  vpc/
  compute/
  database/

Each environment has its own state file, backend, and variables. Complete isolation. You can apply dev without touching prod.

Option 2: Workspaces (simpler but less isolated)

terraform workspace new dev
terraform workspace new staging
terraform workspace new prod
# Use workspace name in config
locals {
  env_config = {
    dev     = { instance_type = "t3.micro",  count = 1 }
    staging = { instance_type = "t3.small",  count = 2 }
    prod    = { instance_type = "m5.large",  count = 3 }
  }
  config = local.env_config[terraform.workspace]
}

Workspaces share a backend. One set of IAM permissions. Less isolation. Use directory structure for production workloads.

Option 3: Terragrunt (DRY across environments)

# terragrunt.hcl in each environment
terraform {
  source = "../../../modules//vpc"
}

include "root" {
  path = find_in_parent_folders()
}

inputs = {
  vpc_cidr    = "10.0.0.0/16"
  environment = "prod"
}

Terragrunt reduces duplication but adds another tool to learn and maintain.

Terraform in CI/CD

The Standard Pipeline

# .github/workflows/terraform.yml
name: Terraform
on:
  pull_request:
    paths: ['infra/**']
  push:
    branches: [main]
    paths: ['infra/**']

jobs:
  plan:
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.0

      - name: Init
        run: terraform init -input=false
        working-directory: infra/

      - name: Validate
        run: terraform validate
        working-directory: infra/

      - name: Plan
        id: plan
        run: terraform plan -input=false -no-color -out=plan.tfplan
        working-directory: infra/
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

      - name: Comment plan on PR
        uses: actions/github-script@v7
        with:
          script: |
            const output = `${{ steps.plan.outputs.stdout }}`;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '```\n' + output + '\n```'
            });

  apply:
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.0

      - name: Init
        run: terraform init -input=false
        working-directory: infra/

      - name: Apply
        run: terraform apply -input=false -auto-approve
        working-directory: infra/
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Key principle: Plan on PR (so reviewers see what will change), apply on merge to main (so approved changes execute). Never auto-approve on PRs.

OIDC Authentication (no long-lived keys)

permissions:
  id-token: write
  contents: read

steps:
  - name: Configure AWS credentials
    uses: aws-actions/configure-aws-credentials@v4
    with:
      role-to-assume: arn:aws:iam::123456789012:role/terraform-ci
      aws-region: us-east-1

OIDC eliminates long-lived AWS keys in CI. The CI runner gets temporary credentials for each run.

Managing Secrets in Terraform

The Problem

Terraform state stores resource attributes in plaintext. If you pass a database password as a variable, it ends up in the state file. Even with remote state, anyone with state access sees the secret.

Solutions

# 1. Use a secrets manager -- Terraform reads, doesn't store
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "prod/database/password"
}

resource "aws_db_instance" "main" {
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}
# WARNING: The password STILL ends up in state. But it's read from the
# secrets manager rather than hardcoded or passed as a variable.

# 2. Use HashiCorp Vault provider
data "vault_generic_secret" "db" {
  path = "secret/data/prod/database"
}

resource "aws_db_instance" "main" {
  password = data.vault_generic_secret.db.data["password"]
}

# 3. Generate passwords and store them
resource "random_password" "db" {
  length  = 32
  special = true
}

resource "aws_secretsmanager_secret_version" "db" {
  secret_id     = aws_secretsmanager_secret.db.id
  secret_string = random_password.db.result
}

resource "aws_db_instance" "main" {
  password = random_password.db.result
}

The uncomfortable truth: Terraform state will contain secrets no matter what. Encrypt the state backend (S3 server-side encryption, GCS default encryption), restrict access with IAM, and enable versioning.

Mark sensitive values:

variable "db_password" {
  type      = string
  sensitive = true  # Hides from plan output, still in state
}

output "db_endpoint" {
  value     = aws_db_instance.main.endpoint
  sensitive = true
}

Performance Optimization for Large States

Symptoms

terraform plan takes 10+ minutes
terraform refresh is painfully slow

Diagnosis

# Time the plan
time terraform plan

# Count resources in state
terraform state list | wc -l

# Enable timing info
export TF_LOG=INFO
terraform plan 2>&1 | grep "after applying"

Solutions

# 1. Skip refresh during plan (use when you know state is fresh)
terraform plan -refresh=false

# 2. Target specific resources
terraform plan -target=module.compute

# 3. Parallelize (default is 10)
terraform apply -parallelism=30

Structural fixes: - Split large monolithic state into smaller, focused states (network, compute, database) - Use terraform_remote_state data source to share outputs between states - Move stable, rarely-changing resources (VPC, IAM) into their own state

# In the compute state, read VPC outputs
data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "mycompany-terraform-state"
    key    = "prod/network/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_instance" "web" {
  subnet_id = data.terraform_remote_state.network.outputs.private_subnet_ids[0]
}

Debugging Slow Plans

# Profile with trace logging
export TF_LOG=TRACE
terraform plan 2>trace.log

# Look for slow API calls
grep "HTTP/1.1" trace.log | sort -t= -k2 -n

# Check which resources take longest to refresh
grep "Refreshing state" trace.log

Common causes: - Too many resources in one state (split it) - Data sources querying large datasets (filter more aggressively) - Provider API rate limiting (increase parallelism, or reduce it if you're being throttled) - Distant backend (use a backend in the same region as your infrastructure)