Skip to content

Portal | Level: L2: Operations | Topics: Terraform Deep Dive, Terraform | Domain: DevOps & Tooling

Terraform Deep Dive - Primer

Beyond the basics. This covers the internals, patterns, and advanced features that separate operators who use Terraform from operators who understand it.

State Management

State is the single most critical concept in Terraform. Every resource Terraform manages is tracked in state. Lose the state, and Terraform loses its knowledge of what it manages.

Remote State

Local state is a non-starter for teams. Remote state backends store the state file centrally so everyone reads the same truth.

S3 + DynamoDB (AWS standard):

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "prod/network/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

Gotcha: The DynamoDB table for state locking needs exactly one attribute: a partition key named LockID of type String. If you name it anything else (like ID or lock_id), Terraform silently fails to acquire locks and you get no locking protection.

The DynamoDB table provides state locking. Without it, two terraform apply runs can execute concurrently and corrupt state. The table needs a single partition key named LockID of type String.

# Create the lock table
aws dynamodb create-table \
  --table-name terraform-locks \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

GCS (GCP standard):

terraform {
  backend "gcs" {
    bucket = "mycompany-terraform-state"
    prefix = "prod/network"
  }
}

GCS locking is built-in -- no extra table needed. State is encrypted at rest by default.

Terraform Cloud / HCP Terraform:

terraform {
  cloud {
    organization = "mycompany"
    workspaces {
      name = "prod-network"
    }
  }
}

This gives you state management, locking, run history, cost estimation, and policy enforcement in one package. The tradeoff is vendor lock-in and network dependency.

State Locking

When someone runs terraform plan or apply, Terraform acquires a lock. If another operator tries to run at the same time:

Error: Error acquiring the state lock

Lock Info:
  ID:        f8b2e1a3-7c4d-9e6f-0a1b-2c3d4e5f6a7b
  Path:      s3://mycompany-terraform-state/prod/network/terraform.tfstate
  Operation: OperationTypeApply
  Who:       alice@workstation
  Version:   1.7.0
  Created:   2026-03-15 14:32:01.234567 +0000 UTC

If a lock is stuck (operator's laptop crashed mid-apply):

# Force unlock -- ONLY after confirming no apply is actually running
terraform force-unlock f8b2e1a3-7c4d-9e6f-0a1b-2c3d4e5f6a7b

Force-unlock is dangerous. If an apply is still running and you unlock, a concurrent apply can now run. Always verify the lock holder is actually gone.

State Operations

These commands manipulate state directly. Use them carefully.

# Show the full state of a resource
terraform state show aws_instance.web

# List all resources in state
terraform state list

# Move a resource (rename in state without destroying/recreating)
terraform state mv aws_instance.web aws_instance.application

# Move a resource into a module
terraform state mv aws_instance.web module.compute.aws_instance.web

# Remove a resource from state (Terraform "forgets" it, infrastructure stays)
terraform state rm aws_instance.legacy

# Pull the entire state to stdout (useful for backups)
terraform state pull > backup.tfstate

# Push a state file (dangerous -- overwrites remote state)
terraform state push corrected.tfstate

# Import existing infrastructure into state
terraform import aws_instance.web i-0abc123def456789a

Critical rule: terraform state rm does NOT destroy infrastructure. It removes Terraform's knowledge of the resource. The resource continues to exist but becomes unmanaged. This is useful when migrating resources between state files.

State File Structure

The state file is JSON. Understanding the structure helps when debugging:

{
  "version": 4,
  "terraform_version": "1.7.0",
  "serial": 42,
  "lineage": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "outputs": {},
  "resources": [
    {
      "mode": "managed",
      "type": "aws_instance",
      "name": "web",
      "provider": "provider[\"registry.terraform.io/hashicorp/aws\"]",
      "instances": [
        {
          "attributes": {
            "id": "i-0abc123def456789a",
            "ami": "ami-0c55b159cbfafe1f0",
            "instance_type": "t3.micro"
          }
        }
      ]
    }
  ]
}

The serial increments with every state change. The lineage is a UUID assigned at creation -- it prevents accidentally pushing state from one environment to another.

Workspaces

Workspaces let you maintain multiple state files within the same configuration. Each workspace has its own state.

# List workspaces
terraform workspace list

# Create and switch
terraform workspace new staging
terraform workspace select production

# Show current
terraform workspace show

# Delete (must switch away first)
terraform workspace select default
terraform workspace delete staging

Using the workspace name in config:

resource "aws_instance" "web" {
  instance_type = terraform.workspace == "prod" ? "m5.large" : "t3.micro"

  tags = {
    Environment = terraform.workspace
  }
}

# Dynamic backend key per workspace (not supported in backend blocks --
# use partial configuration instead)

Workspace gotcha: All workspaces share the same backend. In S3, they're stored as env:/staging/terraform.tfstate, env:/production/terraform.tfstate. This means a single set of IAM permissions controls access to all environments. For real isolation, use separate state files with separate backends per environment.

Modules

Modules are Terraform's primary code reuse mechanism. A module is a directory of .tf files.

Module Structure

modules/
  vpc/
    main.tf          # Resources
    variables.tf     # Input variables
    outputs.tf       # Output values
    versions.tf      # Provider requirements
    README.md        # Documentation

Input Variables

# modules/vpc/variables.tf
variable "vpc_cidr" {
  description = "CIDR block for the VPC"
  type        = string
  default     = "10.0.0.0/16"

  validation {
    condition     = can(cidrnetmask(var.vpc_cidr))
    error_message = "Must be a valid CIDR block."
  }
}

variable "availability_zones" {
  description = "List of AZs"
  type        = list(string)
}

variable "environment" {
  description = "Environment name"
  type        = string

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Must be dev, staging, or prod."
  }
}

Output Values

# modules/vpc/outputs.tf
output "vpc_id" {
  description = "ID of the created VPC"
  value       = aws_vpc.main.id
}

output "private_subnet_ids" {
  description = "IDs of private subnets"
  value       = aws_subnet.private[*].id
}

Calling Modules

module "vpc" {
  source = "./modules/vpc"

  vpc_cidr           = "10.0.0.0/16"
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
  environment        = "prod"
}

# Use outputs from the module
resource "aws_instance" "web" {
  subnet_id = module.vpc.private_subnet_ids[0]
}

Module Sources

Modules can come from many places:

# Local path
module "vpc" {
  source = "./modules/vpc"
}

# Terraform Registry
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.5.0"
}

# GitHub
module "vpc" {
  source = "github.com/mycompany/terraform-modules//vpc?ref=v2.1.0"
}

# S3
module "vpc" {
  source = "s3::https://s3-us-east-1.amazonaws.com/mycompany-modules/vpc.zip"
}

# Generic git
module "vpc" {
  source = "git::ssh://git@github.com/mycompany/terraform-modules.git//vpc?ref=v2.1.0"
}

Always pin module versions. An unpinned registry module pulls the latest, which may contain breaking changes.

Module Composition

Modules call other modules. A root module composes child modules:

# Root module composes infrastructure
module "network" {
  source      = "./modules/network"
  environment = var.environment
}

module "compute" {
  source     = "./modules/compute"
  vpc_id     = module.network.vpc_id
  subnet_ids = module.network.private_subnet_ids
}

module "database" {
  source     = "./modules/database"
  vpc_id     = module.network.vpc_id
  subnet_ids = module.network.database_subnet_ids
}

Keep modules small and focused. A module that creates a VPC, subnets, route tables, NAT gateways, security groups, EC2 instances, and load balancers is too big. Split it.

Providers

Provider Aliases

When you need multiple configurations of the same provider:

provider "aws" {
  region = "us-east-1"
}

provider "aws" {
  alias  = "west"
  region = "us-west-2"
}

# Use the aliased provider
resource "aws_instance" "west_web" {
  provider      = aws.west
  ami           = "ami-0abc123def456789a"
  instance_type = "t3.micro"
}

# Pass to modules
module "dr_vpc" {
  source = "./modules/vpc"
  providers = {
    aws = aws.west
  }
}

Provider Version Constraints

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"    # >= 5.0, < 6.0
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = ">= 2.25.0" # Any version >= 2.25.0
    }
  }
}

The .terraform.lock.hcl file records exact versions and hashes. Commit this file. It ensures every operator uses the same provider versions.

# Update lock file after changing version constraints
terraform init -upgrade

Data Sources

Data sources read information from infrastructure without managing it:

# Look up the latest Amazon Linux 2023 AMI
data "aws_ami" "amazon_linux" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["al2023-ami-*-x86_64"]
  }
}

# Look up existing VPC by tag
data "aws_vpc" "existing" {
  filter {
    name   = "tag:Name"
    values = ["production-vpc"]
  }
}

# Use data source values
resource "aws_instance" "web" {
  ami       = data.aws_ami.amazon_linux.id
  subnet_id = data.aws_subnet.existing.id
}

Data sources are evaluated during plan, so they create a dependency on the live infrastructure. If the queried resource doesn't exist, the plan fails.

Lifecycle Rules

Lifecycle meta-arguments control how Terraform handles resource changes:

resource "aws_instance" "web" {
  ami           = var.ami_id
  instance_type = var.instance_type

  lifecycle {
    # Create the new one before destroying the old one
    # Essential for zero-downtime replacements
    create_before_destroy = true

    # Prevent accidental deletion (plan will fail if it tries)
    prevent_destroy = true

    # Ignore changes made outside Terraform
    # Useful when auto-scaling or another process modifies the resource
    ignore_changes = [
      tags["LastModified"],
      instance_type,
    ]

    # Force replacement when another resource changes
    replace_triggered_by = [
      aws_ami_copy.encrypted.id,
    ]
  }
}

create_before_destroy: Critical for load-balanced instances. Without it, Terraform destroys the old resource before creating the new one, causing downtime.

prevent_destroy: Safety net for databases, S3 buckets with data, and anything you never want accidentally destroyed. Note: terraform state rm bypasses this.

ignore_changes: Use sparingly. It hides drift, which means your config no longer represents reality. Legitimate uses: auto-scaling group sizes, tags set by external automation.

Dynamic Blocks

When resource blocks contain repeating nested blocks:

variable "ingress_rules" {
  type = list(object({
    port        = number
    protocol    = string
    cidr_blocks = list(string)
    description = string
  }))
  default = [
    { port = 80,  protocol = "tcp", cidr_blocks = ["0.0.0.0/0"], description = "HTTP" },
    { port = 443, protocol = "tcp", cidr_blocks = ["0.0.0.0/0"], description = "HTTPS" },
    { port = 22,  protocol = "tcp", cidr_blocks = ["10.0.0.0/8"], description = "SSH internal" },
  ]
}

resource "aws_security_group" "web" {
  name   = "web-sg"
  vpc_id = aws_vpc.main.id

  dynamic "ingress" {
    for_each = var.ingress_rules
    content {
      from_port   = ingress.value.port
      to_port     = ingress.value.port
      protocol    = ingress.value.protocol
      cidr_blocks = ingress.value.cidr_blocks
      description = ingress.value.description
    }
  }
}

Don't overuse dynamic blocks. If you have one or two static blocks, write them out. Dynamic blocks make the code harder to read.

for_each vs count

count

resource "aws_instance" "web" {
  count         = 3
  ami           = var.ami_id
  instance_type = "t3.micro"

  tags = {
    Name = "web-${count.index}"
  }
}
# Creates: aws_instance.web[0], aws_instance.web[1], aws_instance.web[2]

for_each

resource "aws_instance" "web" {
  for_each      = toset(["app1", "app2", "app3"])
  ami           = var.ami_id
  instance_type = "t3.micro"

  tags = {
    Name = each.key
  }
}
# Creates: aws_instance.web["app1"], aws_instance.web["app2"], aws_instance.web["app3"]

The critical difference: With count, removing item at index 1 shifts all subsequent indexes, causing Terraform to destroy and recreate resources 2+ with new indexes. With for_each, removing "app2" only affects that one resource. Always prefer for_each for resources that may change independently.

Remember: count = array (index-based, fragile), for_each = map (key-based, stable). If you would not want a production database destroyed because you removed a different database from a list, use for_each.

for_each with maps

variable "instances" {
  type = map(object({
    instance_type = string
    ami           = string
  }))
  default = {
    web = { instance_type = "t3.micro", ami = "ami-abc123" }
    api = { instance_type = "t3.small", ami = "ami-def456" }
  }
}

resource "aws_instance" "this" {
  for_each      = var.instances
  ami           = each.value.ami
  instance_type = each.value.instance_type

  tags = {
    Name = each.key
  }
}

Custom Conditions (Terraform 1.2+)

Preconditions and postconditions add validation at plan or apply time:

resource "aws_instance" "web" {
  ami           = var.ami_id
  instance_type = var.instance_type

  lifecycle {
    precondition {
      condition     = data.aws_ami.selected.architecture == "x86_64"
      error_message = "AMI must be x86_64 architecture."
    }

    postcondition {
      condition     = self.public_ip != ""
      error_message = "Instance must have a public IP assigned."
    }
  }
}

Preconditions check before creation. Postconditions check after. Use them to catch misconfigurations early rather than debugging resource failures.

Moved Blocks (Terraform 1.1+)

When you refactor code and rename resources:

# Tell Terraform the resource moved, don't destroy + recreate
moved {
  from = aws_instance.web
  to   = aws_instance.application
}

# Move into a module
moved {
  from = aws_instance.web
  to   = module.compute.aws_instance.web
}

Moved blocks are declarative -- they live in your code and tell Terraform about the rename. After applying, you can optionally remove them. This is safer than terraform state mv because it's reviewed in PRs and applied as part of normal workflow.

Import Blocks (Terraform 1.5+)

Declarative imports, replacing the imperative terraform import command:

import {
  to = aws_instance.web
  id = "i-0abc123def456789a"
}

resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.micro"
}

Run terraform plan and Terraform shows what the import will look like. Adjust your resource config until the plan shows no changes after import. Then apply.

Generate config from an import (Terraform 1.5+):

# Generate HCL config for the imported resource
terraform plan -generate-config-out=generated.tf

This writes a .tf file with the resource configuration matching the imported state. Review it, clean it up, and integrate it into your codebase.

Testing Framework (Terraform 1.6+)

Native testing with .tftest.hcl files:

# tests/vpc.tftest.hcl

variables {
  vpc_cidr    = "10.0.0.0/16"
  environment = "test"
}

run "create_vpc" {
  command = apply

  assert {
    condition     = aws_vpc.main.cidr_block == "10.0.0.0/16"
    error_message = "VPC CIDR does not match."
  }

  assert {
    condition     = length(aws_subnet.private) == 3
    error_message = "Expected 3 private subnets."
  }
}

run "verify_tags" {
  command = plan

  assert {
    condition     = aws_vpc.main.tags["Environment"] == "test"
    error_message = "Environment tag not set correctly."
  }
}
# Run tests
terraform test

# Run specific test file
terraform test -filter=tests/vpc.tftest.hcl

# Verbose output
terraform test -verbose

Tests can use command = plan (no real resources created) or command = apply (creates real infrastructure, then destroys it after the test). Plan-only tests are faster and free. Apply tests catch issues that only appear at apply time.

Plan Files and Surgical Applies

# Save a plan to a file
terraform plan -out=plan.tfplan

# Apply only that exact plan (no drift between plan and apply)
terraform apply plan.tfplan

# Target a specific resource (surgical apply)
terraform plan -target=aws_instance.web
terraform apply -target=aws_instance.web

# Destroy a specific resource
terraform destroy -target=aws_instance.legacy

# Show plan in JSON (for programmatic analysis)
terraform show -json plan.tfplan

Plan files are binary and may contain secrets (they embed the full state at plan time). Don't store them in git or artifact stores without encryption.

-target is an escape hatch, not a workflow. If you're using it regularly, your state is too big or your dependencies are wrong. Split into smaller states.

Locals

Locals are computed values within a module. Use them to avoid repeating expressions:

locals {
  common_tags = {
    Environment = var.environment
    Project     = var.project_name
    ManagedBy   = "terraform"
    Team        = var.team
  }

  name_prefix = "${var.project_name}-${var.environment}"

  # Conditional logic
  is_prod = var.environment == "prod"

  # Transform data
  subnet_cidrs = [for i, az in var.availability_zones :
    cidrsubnet(var.vpc_cidr, 8, i)
  ]
}

resource "aws_instance" "web" {
  tags = merge(local.common_tags, {
    Name = "${local.name_prefix}-web"
  })
}

Provisioners (and Why to Avoid Them)

Provisioners run scripts on resources after creation:

resource "aws_instance" "web" {
  ami           = var.ami_id
  instance_type = "t3.micro"

  # Runs on the remote machine via SSH
  provisioner "remote-exec" {
    inline = [
      "sudo apt-get update",
      "sudo apt-get install -y nginx",
    ]

    connection {
      type        = "ssh"
      user        = "ubuntu"
      private_key = file("~/.ssh/id_rsa")
      host        = self.public_ip
    }
  }

  # Runs on the machine running Terraform
  provisioner "local-exec" {
    command = "echo ${self.public_ip} >> inventory.txt"
  }
}

Why to avoid them: Provisioners make resources not truly declarative. If the provisioner fails, the resource is tainted. If you change the provisioner script, Terraform doesn't re-run it (it only runs on creation). Use Packer to bake AMIs, cloud-init for bootstrap scripts, or Ansible for configuration management. The only legitimate use of local-exec is triggering external systems (like updating a DNS record via an API that has no Terraform provider).


Wiki Navigation

Prerequisites