Answer Key: The Replica That Fell Behind¶
The System¶
A PostgreSQL primary-replica pair deployed as a StatefulSet in Kubernetes on AWS. The primary (postgresql-0) handles all writes and is exposed via a ClusterIP service. The replica (postgresql-1) is intended for read-only queries via the postgresql-read service. Both instances are spread across two availability zones for redundancy.
[Applications] --write--> [postgresql service (10.96.142.18)]
|
[postgresql-0] (primary, subnet A: 10.0.10.0/24, AZ us-east-1a)
|
WAL streaming (BLOCKED by NACL)
|
[postgresql-1] (replica, subnet B: 10.0.11.0/24, AZ us-east-1b)
|
[Applications] --read--> [postgresql-read service (10.96.83.205)]
^ Returns no healthy endpoints (pod not ready)
What's Broken¶
Root cause: The AWS Network ACL rule for the database subnets only allows inbound TCP/5432 from 10.0.10.0/24 (subnet A). The replica in subnet B (10.0.11.0/24) cannot connect to the primary on port 5432 because the NACL blocks the traffic. The replica has been stuck in startup state since deployment (14 days ago), unable to establish streaming replication.
Because the replica is not replicating, its readiness probe fails, the pod shows Ready: False, and the postgresql-read service has no healthy endpoints — so all read traffic is either failing or falling back to the primary.
Key clue: The Terraform NACL rule with cidr_block = "10.0.10.0/24" only covers one of the two database subnets. Cross-referencing with the PostgreSQL log showing "connection refused" from the replica to the primary confirms network-level blocking.
The Fix¶
Immediate (restore replication)¶
Add a NACL rule for the second subnet:
aws ec2 create-network-acl-entry \
--network-acl-id acl-XXXXX \
--rule-number 101 \
--protocol tcp \
--port-range From=5432,To=5432 \
--cidr-block 10.0.11.0/24 \
--rule-action allow \
--ingress
Then verify replication resumes:
kubectl exec -n databases postgresql-0 -- psql -U postgres \
-c "SELECT client_addr, state, sent_lsn, replay_lsn FROM pg_stat_replication;"
Permanent (fix in Terraform)¶
Add the missing NACL rule:
resource "aws_network_acl_rule" "database_inbound_b" {
network_acl_id = aws_network_acl.database.id
rule_number = 101
egress = false
protocol = "tcp"
rule_action = "allow"
cidr_block = "10.0.11.0/24"
from_port = 5432
to_port = 5432
}
Or better, use a broader CIDR that covers all database subnets:
resource "aws_network_acl_rule" "database_inbound" {
network_acl_id = aws_network_acl.database.id
rule_number = 100
egress = false
protocol = "tcp"
rule_action = "allow"
cidr_block = "10.0.0.0/16" # Allow within VPC
from_port = 5432
to_port = 5432
}
If lag is too large, the replica may need a fresh base backup:
kubectl delete pod postgresql-1 -n databases
# The StatefulSet controller will recreate it and it will re-bootstrap from primary
Verification¶
# Check replica readiness
kubectl get pods -n databases -l app.kubernetes.io/name=postgresql
# Check replication state
kubectl exec -n databases postgresql-0 -- psql -U postgres \
-c "SELECT state, sent_lsn, replay_lsn, replay_lag FROM pg_stat_replication;"
# Check replication lag metric is decreasing
curl -s http://postgresql-1:9187/metrics | grep pg_replication_lag
# Check read service has endpoints
kubectl get endpoints postgresql-read -n databases
Artifact Decoder¶
| Artifact | What It Revealed | What Was Misleading |
|---|---|---|
| CLI Output | Replica running but not ready — replication is broken, not the container itself | STATUS: Running makes it look healthy at first glance |
| Metrics | 9.8 days of lag and startup state (not streaming) = replication never established |
pg_up shows both instances are "up" — technically true but the exporter connects locally |
| IaC Snippet | NACL rule only allows 10.0.10.0/24 — missing the second subnet | Two subnets defined makes it look like HA is properly configured |
| Log Lines | "connection refused" from replica to primary confirms network-level blocking | The checkpoint log from postgresql-0 is normal PostgreSQL housekeeping, not diagnostic |
Skills Demonstrated¶
- Distinguishing pod
Runningfrom podReadyin Kubernetes - Interpreting PostgreSQL replication metrics and states
- Reading AWS Network ACL rules and identifying missing entries
- Cross-referencing infrastructure code with application-layer errors
- Understanding StatefulSet networking with headless services