Terraform Deep Dive - Footguns¶

Advanced mistakes that experienced operators still make. These go beyond the basics into the territory where real infrastructure gets damaged.

1. State File Stored in Git (with Secrets)¶

The state file contains every attribute of every resource Terraform manages. For an RDS instance, that includes the master password in plaintext. For an AWS access key resource, the secret key is in state. Committing terraform.tfstate means your secrets are in git history forever, even if you delete the file.

# Check if state was ever committed
git log --all --full-history -- '*.tfstate'
git log --all --full-history -- '*.tfstate.backup'

Fix: Use remote state from day one. Add to .gitignore:

*.tfstate
*.tfstate.backup
*.tfplan
.terraform/

If state was already committed, rotate every secret it contained. git filter-branch or bfg can rewrite history, but assume the secrets are compromised.

2. No State Locking (Concurrent Applies Corrupt State)¶

Two engineers run terraform apply at the same time. Both read state serial 42. Both make changes. The second one to finish overwrites the first's changes. Resources managed by the first apply are now orphaned -- they exist in AWS but not in state.

Fix: Always configure state locking.

# S3 backend: add dynamodb_table
terraform {
  backend "s3" {
    bucket         = "mycompany-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"  # THIS LINE
    encrypt        = true
  }
}

For Terraform Cloud and GCS, locking is built-in. There's no excuse for running without it.

3. `terraform destroy` Without Plan Review¶

You're cleaning up a dev environment. You type terraform destroy -auto-approve. But you're in the wrong directory. Or the wrong workspace. Production is gone.

Fix: Never use -auto-approve with destroy. Always run terraform plan -destroy first and read every line. Add prevent_destroy lifecycle rules on critical resources:

resource "aws_db_instance" "main" {
  # ...
  lifecycle {
    prevent_destroy = true
  }
}

In CI, require a separate approval step between plan-destroy and actual destroy.

4. `-auto-approve` in CI Without Plan Artifact¶

# DANGEROUS: plans and applies in one step
- run: terraform apply -auto-approve

The plan at review time and the plan at apply time can differ (someone pushed a change, state drifted, provider updated). You approved one plan and executed a different one.

Fix: Save the plan and apply the saved plan:

# On PR
- run: terraform plan -out=plan.tfplan

# On merge (apply the reviewed plan, not a new one)
- run: terraform apply plan.tfplan

This requires passing the plan artifact between jobs (upload/download action). It's worth the complexity.

5. count Index Shift Destroys Resources¶

variable "subnets" {
  default = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
}

resource "aws_subnet" "main" {
  count      = length(var.subnets)
  cidr_block = var.subnets[count.index]
}

Remove the second subnet (10.0.2.0/24). Terraform now maps index 1 to 10.0.3.0/24 instead of 10.0.2.0/24. It will destroy subnet 2, destroy subnet 3, and recreate subnet 3 with the new CIDR at index 1. If anything is deployed in those subnets, it's disrupted.

Fix: Use for_each with stable keys:

resource "aws_subnet" "main" {
  for_each   = toset(var.subnets)
  cidr_block = each.value
}

Removing an element only affects that specific element. No index shifting.

6. Hardcoded Values Instead of Variables¶

# This works until you need a second environment
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "m5.xlarge"
  subnet_id     = "subnet-0abc123"

  tags = {
    Name = "prod-web-server"
  }
}

Now you need staging. You copy the file, change the hardcoded values, and maintain two diverging copies. When you fix a bug in prod, you forget to fix it in staging.

Fix: Parameterize everything that varies between environments:

resource "aws_instance" "web" {
  ami           = var.ami_id
  instance_type = var.instance_type
  subnet_id     = var.subnet_id

  tags = {
    Name = "${var.environment}-web-server"
  }
}

7. Not Pinning Provider Versions¶

# No version constraint
terraform {
  required_providers {
    aws = {
      source = "hashicorp/aws"
    }
  }
}

A new provider version ships with a breaking change. terraform init on a new machine pulls the latest. Your plan now shows 47 unexpected changes.

Fix: Pin and lock:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.30"  # >= 5.30, < 6.0
    }
  }
}

Commit .terraform.lock.hcl to git. Update providers intentionally with terraform init -upgrade.

8. `ignore_changes` Hiding Real Drift¶

resource "aws_autoscaling_group" "web" {
  desired_capacity = 3

  lifecycle {
    ignore_changes = [desired_capacity]
  }
}

This seems reasonable -- auto-scaling changes the count, and you don't want Terraform resetting it. But now if someone manually sets desired_capacity to 1 (or 100), Terraform won't tell you. You've created a blind spot.

Fix: Be specific about what you ignore and why. Document it. Consider whether a separate monitoring check should alert on the ignored attribute. Never use ignore_changes = all -- it makes the entire resource invisible to drift detection.

9. Data Sources Creating Dependency on Running Infrastructure¶

data "aws_instance" "web" {
  filter {
    name   = "tag:Name"
    values = ["web-server"]
  }
}

If web-server doesn't exist yet, terraform plan fails. If it was terminated, plan fails. Data sources assume the infrastructure exists. This creates a runtime dependency that can break your pipeline.

Fix: For resources you manage, reference them directly (aws_instance.web.id). Only use data sources for infrastructure managed by another team or state file. Handle the "not found" case:

data "aws_instances" "web" {  # Note: plural, returns list
  filter {
    name   = "tag:Name"
    values = ["web-server"]
  }
}

locals {
  web_instance_id = length(data.aws_instances.web.ids) > 0 ? data.aws_instances.web.ids[0] : null
}

10. Large Monolithic State (Blast Radius)¶

You have 500 resources in one state file. Every terraform plan takes 8 minutes because it refreshes everything. Every apply risks 500 resources. A bad module change can cascade across the entire infrastructure.

Fix: Split state by concern: - network/ -- VPC, subnets, route tables (changes rarely) - security/ -- IAM roles, policies, security groups - compute/ -- EC2, ASGs, load balancers - database/ -- RDS, ElastiCache, DynamoDB - monitoring/ -- CloudWatch alarms, dashboards

Use terraform_remote_state data source to pass outputs between states. Each state file has its own blast radius.

11. Workspace State in the Same Backend (No Isolation)¶

terraform {
  backend "s3" {
    bucket = "mycompany-state"
    key    = "terraform.tfstate"
  }
}

With workspaces, dev and prod state sit in the same S3 bucket under env:/dev/ and env:/prod/. Same IAM permissions access both. A compromised CI job for dev can read prod state (which contains prod secrets).

Fix: Use separate backends per environment:

# environments/prod/backend.tf
terraform {
  backend "s3" {
    bucket = "mycompany-state-prod"  # Separate bucket
    key    = "terraform.tfstate"
    region = "us-east-1"
  }
}

Different buckets, different IAM policies, different blast radius.

12. Default Tags Not Set (Untagged Resources)¶

You create 200 resources. None have cost-allocation tags. Your CFO asks which team is responsible for the $47,000 bill spike. You can't tell.

Fix: Use default tags at the provider level:

provider "aws" {
  region = "us-east-1"

  default_tags {
    tags = {
      Environment = var.environment
      Team        = var.team
      Project     = var.project
      ManagedBy   = "terraform"
      CostCenter  = var.cost_center
    }
  }
}

Every resource created through this provider gets these tags automatically. Individual resources can override them.

13. Sensitive Values in Plan Output¶

output "db_password" {
  value = random_password.db.result
}

This prints the password in terraform output, in plan output, and in CI logs.

Fix: Mark outputs as sensitive:

output "db_password" {
  value     = random_password.db.result
  sensitive = true
}

But remember: sensitive = true only hides from console output. The value is still in state. Anyone with state access can run terraform output -json and see it.

14. Provisioners That Make Resources Non-Idempotent¶

resource "aws_instance" "web" {
  ami           = var.ami_id
  instance_type = "t3.micro"

  provisioner "remote-exec" {
    inline = [
      "curl -fsSL https://get.docker.com | sh",
      "docker run -d -p 80:80 nginx",
    ]
  }
}

Provisioners only run on creation. If the instance is stopped and started, Docker and nginx are gone. If you taint and recreate, you depend on an external URL being available. The resource isn't truly declarative.

Fix: Use user_data (cloud-init) for bootstrap, Packer for baked AMIs, or configuration management (Ansible) for ongoing state. The only acceptable local-exec is calling an external API that has no Terraform provider.

15. Terraform State as a Source of Truth for Application Config¶

# Anti-pattern: application reads config from Terraform output
output "api_endpoint" {
  value = aws_lb.api.dns_name
}
# App deployment script: terraform output -json | jq .api_endpoint

This couples your application deployment to Terraform state access. If state is locked (someone running apply), your deployment blocks. If state is corrupted, your deployment breaks.

Fix: Write outputs to a proper configuration store (SSM Parameter Store, Consul, environment variables in your deployment platform). Terraform creates the resource and writes the endpoint to SSM. The application reads from SSM.

resource "aws_ssm_parameter" "api_endpoint" {
  name  = "/${var.environment}/api/endpoint"
  type  = "String"
  value = aws_lb.api.dns_name
}