Skip to content

Answer Key: The Replica That Fell Behind

The System

A PostgreSQL primary-replica pair deployed as a StatefulSet in Kubernetes on AWS. The primary (postgresql-0) handles all writes and is exposed via a ClusterIP service. The replica (postgresql-1) is intended for read-only queries via the postgresql-read service. Both instances are spread across two availability zones for redundancy.

[Applications] --write--> [postgresql service (10.96.142.18)]
                               |
                          [postgresql-0] (primary, subnet A: 10.0.10.0/24, AZ us-east-1a)
                               |
                          WAL streaming (BLOCKED by NACL)
                               |
                          [postgresql-1] (replica, subnet B: 10.0.11.0/24, AZ us-east-1b)
                               |
[Applications] --read-->  [postgresql-read service (10.96.83.205)]
                               ^ Returns no healthy endpoints (pod not ready)

What's Broken

Root cause: The AWS Network ACL rule for the database subnets only allows inbound TCP/5432 from 10.0.10.0/24 (subnet A). The replica in subnet B (10.0.11.0/24) cannot connect to the primary on port 5432 because the NACL blocks the traffic. The replica has been stuck in startup state since deployment (14 days ago), unable to establish streaming replication.

Because the replica is not replicating, its readiness probe fails, the pod shows Ready: False, and the postgresql-read service has no healthy endpoints — so all read traffic is either failing or falling back to the primary.

Key clue: The Terraform NACL rule with cidr_block = "10.0.10.0/24" only covers one of the two database subnets. Cross-referencing with the PostgreSQL log showing "connection refused" from the replica to the primary confirms network-level blocking.

The Fix

Immediate (restore replication)

Add a NACL rule for the second subnet:

aws ec2 create-network-acl-entry \
  --network-acl-id acl-XXXXX \
  --rule-number 101 \
  --protocol tcp \
  --port-range From=5432,To=5432 \
  --cidr-block 10.0.11.0/24 \
  --rule-action allow \
  --ingress

Then verify replication resumes:

kubectl exec -n databases postgresql-0 -- psql -U postgres \
  -c "SELECT client_addr, state, sent_lsn, replay_lsn FROM pg_stat_replication;"

Permanent (fix in Terraform)

Add the missing NACL rule:

resource "aws_network_acl_rule" "database_inbound_b" {
  network_acl_id = aws_network_acl.database.id
  rule_number    = 101
  egress         = false
  protocol       = "tcp"
  rule_action    = "allow"
  cidr_block     = "10.0.11.0/24"
  from_port      = 5432
  to_port        = 5432
}

Or better, use a broader CIDR that covers all database subnets:

resource "aws_network_acl_rule" "database_inbound" {
  network_acl_id = aws_network_acl.database.id
  rule_number    = 100
  egress         = false
  protocol       = "tcp"
  rule_action    = "allow"
  cidr_block     = "10.0.0.0/16"   # Allow within VPC
  from_port      = 5432
  to_port        = 5432
}

If lag is too large, the replica may need a fresh base backup:

kubectl delete pod postgresql-1 -n databases
# The StatefulSet controller will recreate it and it will re-bootstrap from primary

Verification

# Check replica readiness
kubectl get pods -n databases -l app.kubernetes.io/name=postgresql

# Check replication state
kubectl exec -n databases postgresql-0 -- psql -U postgres \
  -c "SELECT state, sent_lsn, replay_lsn, replay_lag FROM pg_stat_replication;"

# Check replication lag metric is decreasing
curl -s http://postgresql-1:9187/metrics | grep pg_replication_lag

# Check read service has endpoints
kubectl get endpoints postgresql-read -n databases

Artifact Decoder

Artifact What It Revealed What Was Misleading
CLI Output Replica running but not ready — replication is broken, not the container itself STATUS: Running makes it look healthy at first glance
Metrics 9.8 days of lag and startup state (not streaming) = replication never established pg_up shows both instances are "up" — technically true but the exporter connects locally
IaC Snippet NACL rule only allows 10.0.10.0/24 — missing the second subnet Two subnets defined makes it look like HA is properly configured
Log Lines "connection refused" from replica to primary confirms network-level blocking The checkpoint log from postgresql-0 is normal PostgreSQL housekeeping, not diagnostic

Skills Demonstrated

  • Distinguishing pod Running from pod Ready in Kubernetes
  • Interpreting PostgreSQL replication metrics and states
  • Reading AWS Network ACL rules and identifying missing entries
  • Cross-referencing infrastructure code with application-layer errors
  • Understanding StatefulSet networking with headless services

Prerequisite Topic Packs