Portal | Level: L2: Operations | Topics: Terraform Deep Dive, Terraform | Domain: DevOps & Tooling
Terraform Deep Dive - Primer¶
Beyond the basics. This covers the internals, patterns, and advanced features that separate operators who use Terraform from operators who understand it.
State Management¶
State is the single most critical concept in Terraform. Every resource Terraform manages is tracked in state. Lose the state, and Terraform loses its knowledge of what it manages.
Remote State¶
Local state is a non-starter for teams. Remote state backends store the state file centrally so everyone reads the same truth.
S3 + DynamoDB (AWS standard):
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "prod/network/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
Gotcha: The DynamoDB table for state locking needs exactly one attribute: a partition key named
LockIDof typeString. If you name it anything else (likeIDorlock_id), Terraform silently fails to acquire locks and you get no locking protection.
The DynamoDB table provides state locking. Without it, two terraform apply runs can execute concurrently and corrupt state. The table needs a single partition key named LockID of type String.
# Create the lock table
aws dynamodb create-table \
--table-name terraform-locks \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST
GCS (GCP standard):
GCS locking is built-in -- no extra table needed. State is encrypted at rest by default.
Terraform Cloud / HCP Terraform:
This gives you state management, locking, run history, cost estimation, and policy enforcement in one package. The tradeoff is vendor lock-in and network dependency.
State Locking¶
When someone runs terraform plan or apply, Terraform acquires a lock. If another operator tries to run at the same time:
Error: Error acquiring the state lock
Lock Info:
ID: f8b2e1a3-7c4d-9e6f-0a1b-2c3d4e5f6a7b
Path: s3://mycompany-terraform-state/prod/network/terraform.tfstate
Operation: OperationTypeApply
Who: alice@workstation
Version: 1.7.0
Created: 2026-03-15 14:32:01.234567 +0000 UTC
If a lock is stuck (operator's laptop crashed mid-apply):
# Force unlock -- ONLY after confirming no apply is actually running
terraform force-unlock f8b2e1a3-7c4d-9e6f-0a1b-2c3d4e5f6a7b
Force-unlock is dangerous. If an apply is still running and you unlock, a concurrent apply can now run. Always verify the lock holder is actually gone.
State Operations¶
These commands manipulate state directly. Use them carefully.
# Show the full state of a resource
terraform state show aws_instance.web
# List all resources in state
terraform state list
# Move a resource (rename in state without destroying/recreating)
terraform state mv aws_instance.web aws_instance.application
# Move a resource into a module
terraform state mv aws_instance.web module.compute.aws_instance.web
# Remove a resource from state (Terraform "forgets" it, infrastructure stays)
terraform state rm aws_instance.legacy
# Pull the entire state to stdout (useful for backups)
terraform state pull > backup.tfstate
# Push a state file (dangerous -- overwrites remote state)
terraform state push corrected.tfstate
# Import existing infrastructure into state
terraform import aws_instance.web i-0abc123def456789a
Critical rule: terraform state rm does NOT destroy infrastructure. It removes Terraform's knowledge of the resource. The resource continues to exist but becomes unmanaged. This is useful when migrating resources between state files.
State File Structure¶
The state file is JSON. Understanding the structure helps when debugging:
{
"version": 4,
"terraform_version": "1.7.0",
"serial": 42,
"lineage": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"outputs": {},
"resources": [
{
"mode": "managed",
"type": "aws_instance",
"name": "web",
"provider": "provider[\"registry.terraform.io/hashicorp/aws\"]",
"instances": [
{
"attributes": {
"id": "i-0abc123def456789a",
"ami": "ami-0c55b159cbfafe1f0",
"instance_type": "t3.micro"
}
}
]
}
]
}
The serial increments with every state change. The lineage is a UUID assigned at creation -- it prevents accidentally pushing state from one environment to another.
Workspaces¶
Workspaces let you maintain multiple state files within the same configuration. Each workspace has its own state.
# List workspaces
terraform workspace list
# Create and switch
terraform workspace new staging
terraform workspace select production
# Show current
terraform workspace show
# Delete (must switch away first)
terraform workspace select default
terraform workspace delete staging
Using the workspace name in config:
resource "aws_instance" "web" {
instance_type = terraform.workspace == "prod" ? "m5.large" : "t3.micro"
tags = {
Environment = terraform.workspace
}
}
# Dynamic backend key per workspace (not supported in backend blocks --
# use partial configuration instead)
Workspace gotcha: All workspaces share the same backend. In S3, they're stored as env:/staging/terraform.tfstate, env:/production/terraform.tfstate. This means a single set of IAM permissions controls access to all environments. For real isolation, use separate state files with separate backends per environment.
Modules¶
Modules are Terraform's primary code reuse mechanism. A module is a directory of .tf files.
Module Structure¶
modules/
vpc/
main.tf # Resources
variables.tf # Input variables
outputs.tf # Output values
versions.tf # Provider requirements
README.md # Documentation
Input Variables¶
# modules/vpc/variables.tf
variable "vpc_cidr" {
description = "CIDR block for the VPC"
type = string
default = "10.0.0.0/16"
validation {
condition = can(cidrnetmask(var.vpc_cidr))
error_message = "Must be a valid CIDR block."
}
}
variable "availability_zones" {
description = "List of AZs"
type = list(string)
}
variable "environment" {
description = "Environment name"
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Must be dev, staging, or prod."
}
}
Output Values¶
# modules/vpc/outputs.tf
output "vpc_id" {
description = "ID of the created VPC"
value = aws_vpc.main.id
}
output "private_subnet_ids" {
description = "IDs of private subnets"
value = aws_subnet.private[*].id
}
Calling Modules¶
module "vpc" {
source = "./modules/vpc"
vpc_cidr = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
environment = "prod"
}
# Use outputs from the module
resource "aws_instance" "web" {
subnet_id = module.vpc.private_subnet_ids[0]
}
Module Sources¶
Modules can come from many places:
# Local path
module "vpc" {
source = "./modules/vpc"
}
# Terraform Registry
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "5.5.0"
}
# GitHub
module "vpc" {
source = "github.com/mycompany/terraform-modules//vpc?ref=v2.1.0"
}
# S3
module "vpc" {
source = "s3::https://s3-us-east-1.amazonaws.com/mycompany-modules/vpc.zip"
}
# Generic git
module "vpc" {
source = "git::ssh://git@github.com/mycompany/terraform-modules.git//vpc?ref=v2.1.0"
}
Always pin module versions. An unpinned registry module pulls the latest, which may contain breaking changes.
Module Composition¶
Modules call other modules. A root module composes child modules:
# Root module composes infrastructure
module "network" {
source = "./modules/network"
environment = var.environment
}
module "compute" {
source = "./modules/compute"
vpc_id = module.network.vpc_id
subnet_ids = module.network.private_subnet_ids
}
module "database" {
source = "./modules/database"
vpc_id = module.network.vpc_id
subnet_ids = module.network.database_subnet_ids
}
Keep modules small and focused. A module that creates a VPC, subnets, route tables, NAT gateways, security groups, EC2 instances, and load balancers is too big. Split it.
Providers¶
Provider Aliases¶
When you need multiple configurations of the same provider:
provider "aws" {
region = "us-east-1"
}
provider "aws" {
alias = "west"
region = "us-west-2"
}
# Use the aliased provider
resource "aws_instance" "west_web" {
provider = aws.west
ami = "ami-0abc123def456789a"
instance_type = "t3.micro"
}
# Pass to modules
module "dr_vpc" {
source = "./modules/vpc"
providers = {
aws = aws.west
}
}
Provider Version Constraints¶
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0" # >= 5.0, < 6.0
}
kubernetes = {
source = "hashicorp/kubernetes"
version = ">= 2.25.0" # Any version >= 2.25.0
}
}
}
The .terraform.lock.hcl file records exact versions and hashes. Commit this file. It ensures every operator uses the same provider versions.
Data Sources¶
Data sources read information from infrastructure without managing it:
# Look up the latest Amazon Linux 2023 AMI
data "aws_ami" "amazon_linux" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["al2023-ami-*-x86_64"]
}
}
# Look up existing VPC by tag
data "aws_vpc" "existing" {
filter {
name = "tag:Name"
values = ["production-vpc"]
}
}
# Use data source values
resource "aws_instance" "web" {
ami = data.aws_ami.amazon_linux.id
subnet_id = data.aws_subnet.existing.id
}
Data sources are evaluated during plan, so they create a dependency on the live infrastructure. If the queried resource doesn't exist, the plan fails.
Lifecycle Rules¶
Lifecycle meta-arguments control how Terraform handles resource changes:
resource "aws_instance" "web" {
ami = var.ami_id
instance_type = var.instance_type
lifecycle {
# Create the new one before destroying the old one
# Essential for zero-downtime replacements
create_before_destroy = true
# Prevent accidental deletion (plan will fail if it tries)
prevent_destroy = true
# Ignore changes made outside Terraform
# Useful when auto-scaling or another process modifies the resource
ignore_changes = [
tags["LastModified"],
instance_type,
]
# Force replacement when another resource changes
replace_triggered_by = [
aws_ami_copy.encrypted.id,
]
}
}
create_before_destroy: Critical for load-balanced instances. Without it, Terraform destroys the old resource before creating the new one, causing downtime.
prevent_destroy: Safety net for databases, S3 buckets with data, and anything you never want accidentally destroyed. Note: terraform state rm bypasses this.
ignore_changes: Use sparingly. It hides drift, which means your config no longer represents reality. Legitimate uses: auto-scaling group sizes, tags set by external automation.
Dynamic Blocks¶
When resource blocks contain repeating nested blocks:
variable "ingress_rules" {
type = list(object({
port = number
protocol = string
cidr_blocks = list(string)
description = string
}))
default = [
{ port = 80, protocol = "tcp", cidr_blocks = ["0.0.0.0/0"], description = "HTTP" },
{ port = 443, protocol = "tcp", cidr_blocks = ["0.0.0.0/0"], description = "HTTPS" },
{ port = 22, protocol = "tcp", cidr_blocks = ["10.0.0.0/8"], description = "SSH internal" },
]
}
resource "aws_security_group" "web" {
name = "web-sg"
vpc_id = aws_vpc.main.id
dynamic "ingress" {
for_each = var.ingress_rules
content {
from_port = ingress.value.port
to_port = ingress.value.port
protocol = ingress.value.protocol
cidr_blocks = ingress.value.cidr_blocks
description = ingress.value.description
}
}
}
Don't overuse dynamic blocks. If you have one or two static blocks, write them out. Dynamic blocks make the code harder to read.
for_each vs count¶
count¶
resource "aws_instance" "web" {
count = 3
ami = var.ami_id
instance_type = "t3.micro"
tags = {
Name = "web-${count.index}"
}
}
# Creates: aws_instance.web[0], aws_instance.web[1], aws_instance.web[2]
for_each¶
resource "aws_instance" "web" {
for_each = toset(["app1", "app2", "app3"])
ami = var.ami_id
instance_type = "t3.micro"
tags = {
Name = each.key
}
}
# Creates: aws_instance.web["app1"], aws_instance.web["app2"], aws_instance.web["app3"]
The critical difference: With count, removing item at index 1 shifts all subsequent indexes, causing Terraform to destroy and recreate resources 2+ with new indexes. With for_each, removing "app2" only affects that one resource. Always prefer for_each for resources that may change independently.
Remember:
count= array (index-based, fragile),for_each= map (key-based, stable). If you would not want a production database destroyed because you removed a different database from a list, usefor_each.
for_each with maps¶
variable "instances" {
type = map(object({
instance_type = string
ami = string
}))
default = {
web = { instance_type = "t3.micro", ami = "ami-abc123" }
api = { instance_type = "t3.small", ami = "ami-def456" }
}
}
resource "aws_instance" "this" {
for_each = var.instances
ami = each.value.ami
instance_type = each.value.instance_type
tags = {
Name = each.key
}
}
Custom Conditions (Terraform 1.2+)¶
Preconditions and postconditions add validation at plan or apply time:
resource "aws_instance" "web" {
ami = var.ami_id
instance_type = var.instance_type
lifecycle {
precondition {
condition = data.aws_ami.selected.architecture == "x86_64"
error_message = "AMI must be x86_64 architecture."
}
postcondition {
condition = self.public_ip != ""
error_message = "Instance must have a public IP assigned."
}
}
}
Preconditions check before creation. Postconditions check after. Use them to catch misconfigurations early rather than debugging resource failures.
Moved Blocks (Terraform 1.1+)¶
When you refactor code and rename resources:
# Tell Terraform the resource moved, don't destroy + recreate
moved {
from = aws_instance.web
to = aws_instance.application
}
# Move into a module
moved {
from = aws_instance.web
to = module.compute.aws_instance.web
}
Moved blocks are declarative -- they live in your code and tell Terraform about the rename. After applying, you can optionally remove them. This is safer than terraform state mv because it's reviewed in PRs and applied as part of normal workflow.
Import Blocks (Terraform 1.5+)¶
Declarative imports, replacing the imperative terraform import command:
import {
to = aws_instance.web
id = "i-0abc123def456789a"
}
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.micro"
}
Run terraform plan and Terraform shows what the import will look like. Adjust your resource config until the plan shows no changes after import. Then apply.
Generate config from an import (Terraform 1.5+):
This writes a .tf file with the resource configuration matching the imported state. Review it, clean it up, and integrate it into your codebase.
Testing Framework (Terraform 1.6+)¶
Native testing with .tftest.hcl files:
# tests/vpc.tftest.hcl
variables {
vpc_cidr = "10.0.0.0/16"
environment = "test"
}
run "create_vpc" {
command = apply
assert {
condition = aws_vpc.main.cidr_block == "10.0.0.0/16"
error_message = "VPC CIDR does not match."
}
assert {
condition = length(aws_subnet.private) == 3
error_message = "Expected 3 private subnets."
}
}
run "verify_tags" {
command = plan
assert {
condition = aws_vpc.main.tags["Environment"] == "test"
error_message = "Environment tag not set correctly."
}
}
# Run tests
terraform test
# Run specific test file
terraform test -filter=tests/vpc.tftest.hcl
# Verbose output
terraform test -verbose
Tests can use command = plan (no real resources created) or command = apply (creates real infrastructure, then destroys it after the test). Plan-only tests are faster and free. Apply tests catch issues that only appear at apply time.
Plan Files and Surgical Applies¶
# Save a plan to a file
terraform plan -out=plan.tfplan
# Apply only that exact plan (no drift between plan and apply)
terraform apply plan.tfplan
# Target a specific resource (surgical apply)
terraform plan -target=aws_instance.web
terraform apply -target=aws_instance.web
# Destroy a specific resource
terraform destroy -target=aws_instance.legacy
# Show plan in JSON (for programmatic analysis)
terraform show -json plan.tfplan
Plan files are binary and may contain secrets (they embed the full state at plan time). Don't store them in git or artifact stores without encryption.
-target is an escape hatch, not a workflow. If you're using it regularly, your state is too big or your dependencies are wrong. Split into smaller states.
Locals¶
Locals are computed values within a module. Use them to avoid repeating expressions:
locals {
common_tags = {
Environment = var.environment
Project = var.project_name
ManagedBy = "terraform"
Team = var.team
}
name_prefix = "${var.project_name}-${var.environment}"
# Conditional logic
is_prod = var.environment == "prod"
# Transform data
subnet_cidrs = [for i, az in var.availability_zones :
cidrsubnet(var.vpc_cidr, 8, i)
]
}
resource "aws_instance" "web" {
tags = merge(local.common_tags, {
Name = "${local.name_prefix}-web"
})
}
Provisioners (and Why to Avoid Them)¶
Provisioners run scripts on resources after creation:
resource "aws_instance" "web" {
ami = var.ami_id
instance_type = "t3.micro"
# Runs on the remote machine via SSH
provisioner "remote-exec" {
inline = [
"sudo apt-get update",
"sudo apt-get install -y nginx",
]
connection {
type = "ssh"
user = "ubuntu"
private_key = file("~/.ssh/id_rsa")
host = self.public_ip
}
}
# Runs on the machine running Terraform
provisioner "local-exec" {
command = "echo ${self.public_ip} >> inventory.txt"
}
}
Why to avoid them: Provisioners make resources not truly declarative. If the provisioner fails, the resource is tainted. If you change the provisioner script, Terraform doesn't re-run it (it only runs on creation). Use Packer to bake AMIs, cloud-init for bootstrap scripts, or Ansible for configuration management. The only legitimate use of local-exec is triggering external systems (like updating a DNS record via an API that has no Terraform provider).
Wiki Navigation¶
Prerequisites¶
- Terraform / IaC (Topic Pack, L1)
Related Content¶
- Runbook: Terraform Drift Detection Response (Runbook, L2) — Terraform, Terraform Deep Dive
- Case Study: SSH Timeout — MTU Mismatch, Fix Is Terraform Variable (Case Study, L2) — Terraform
- Case Study: Terraform Apply Fails — State Lock Stuck, DynamoDB Throttle (Case Study, L2) — Terraform
- Crossplane (Topic Pack, L2) — Terraform
- Deep Dive: Terraform State Internals (deep_dive, L2) — Terraform
- Mental Models (Core Concepts) (Topic Pack, L0) — Terraform
- OpenTofu & Terraform Ecosystem (Topic Pack, L2) — Terraform
- Pulumi (Topic Pack, L2) — Terraform
- Runbook: Cloud Capacity Limit Hit (Runbook, L2) — Terraform
- Runbook: Terraform State Lock Stuck (Runbook, L2) — Terraform
Pages that link here¶
- Anti-Primer: Terraform Deep Dive
- Certification Prep: HashiCorp Terraform Associate
- Comparison: Infrastructure as Code Tools
- Crossplane
- Opentofu
- Production Readiness Review: Answer Key
- Production Readiness Review: Study Plans
- Pulumi
- Runbook: Cloud Capacity Limit Hit
- Runbook: Terraform Drift Detection Response
- Runbook: Terraform State Lock Stuck
- Symptoms: Terraform Apply Fails, State Lock Stuck, Root Cause Is DynamoDB Throttle
- Terraform / Infrastructure as Code - Skill Check
- Terraform Deep Dive
- Terraform State Internals