Ops Archaeology: The Replica That Fell Behind¶
You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.
Difficulty: L1 Estimated time: 15 min Domains: Kubernetes, PostgreSQL, Terraform, Networking
Artifact 1: CLI Output¶
$ kubectl get pods -n databases -l app.kubernetes.io/name=postgresql
NAME READY STATUS RESTARTS AGE
postgresql-0 1/1 Running 0 14d
postgresql-1 0/1 Running 0 14d
$ kubectl describe pod postgresql-1 -n databases | grep -A5 "Conditions:"
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
$ kubectl get svc -n databases -l app.kubernetes.io/name=postgresql
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
postgresql ClusterIP 10.96.142.18 <none> 5432/TCP 14d
postgresql-headless ClusterIP None <none> 5432/TCP 14d
postgresql-read ClusterIP 10.96.83.205 <none> 5432/TCP 14d
Artifact 2: Metrics¶
# Prometheus query: pg_replication_lag_seconds
pg_replication_lag_seconds{instance="postgresql-1:9187",slot_name="postgresql_1"} 847293
# That's ~9.8 days of replication lag
# Prometheus query: pg_up
pg_up{instance="postgresql-0:9187"} 1
pg_up{instance="postgresql-1:9187"} 1
# Prometheus query: pg_stat_replication_state
pg_stat_replication_state{application_name="postgresql-1",state="streaming"} 0
pg_stat_replication_state{application_name="postgresql-1",state="startup"} 1
Artifact 3: Infrastructure Code¶
# From: terraform/modules/vpc/subnets.tf
resource "aws_subnet" "database_a" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.10.0/24"
availability_zone = "us-east-1a"
tags = {
Name = "db-subnet-a"
Tier = "database"
}
}
resource "aws_subnet" "database_b" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.11.0/24"
availability_zone = "us-east-1b"
tags = {
Name = "db-subnet-b"
Tier = "database"
}
}
resource "aws_network_acl_rule" "database_inbound" {
network_acl_id = aws_network_acl.database.id
rule_number = 100
egress = false
protocol = "tcp"
rule_action = "allow"
cidr_block = "10.0.10.0/24"
from_port = 5432
to_port = 5432
}
Artifact 4: Log Lines¶
[2024-08-20T03:14:22Z] postgresql-1 | FATAL: could not connect to the primary server: connection refused
[2024-08-20T03:14:22Z] postgresql-1 | Is the server running on host "postgresql-0.postgresql-headless.databases.svc" and accepting TCP/IP connections on port 5432?
[2024-08-18T11:30:05Z] postgresql-0 | LOG: checkpoint complete: wrote 14832 buffers (11.3%); 0 WAL file(s) added
Your Mission¶
- Reconstruct: What does this system do? What are its components and purpose?
- Diagnose: What is currently broken or degraded, and why?
- Propose: What would you do to fix it? What would you check first?