Ops Archaeology: The Replica That Fell Behind¶

You've just joined a team. There are no docs. The previous engineer left last month. Something is broken. Here's everything you have to work with.

Difficulty: L1 Estimated time: 15 min Domains: Kubernetes, PostgreSQL, Terraform, Networking

Artifact 1: CLI Output¶

$ kubectl get pods -n databases -l app.kubernetes.io/name=postgresql
NAME              READY   STATUS    RESTARTS   AGE
postgresql-0      1/1     Running   0          14d
postgresql-1      0/1     Running   0          14d

$ kubectl describe pod postgresql-1 -n databases | grep -A5 "Conditions:"
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True

$ kubectl get svc -n databases -l app.kubernetes.io/name=postgresql
NAME                  TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
postgresql            ClusterIP   10.96.142.18   <none>        5432/TCP   14d
postgresql-headless   ClusterIP   None           <none>        5432/TCP   14d
postgresql-read       ClusterIP   10.96.83.205   <none>        5432/TCP   14d

Artifact 2: Metrics¶

# Prometheus query: pg_replication_lag_seconds
pg_replication_lag_seconds{instance="postgresql-1:9187",slot_name="postgresql_1"} 847293

# That's ~9.8 days of replication lag

# Prometheus query: pg_up
pg_up{instance="postgresql-0:9187"} 1
pg_up{instance="postgresql-1:9187"} 1

# Prometheus query: pg_stat_replication_state
pg_stat_replication_state{application_name="postgresql-1",state="streaming"} 0
pg_stat_replication_state{application_name="postgresql-1",state="startup"} 1

Artifact 3: Infrastructure Code¶

# From: terraform/modules/vpc/subnets.tf
resource "aws_subnet" "database_a" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.10.0/24"
  availability_zone = "us-east-1a"

  tags = {
    Name = "db-subnet-a"
    Tier = "database"
  }
}

resource "aws_subnet" "database_b" {
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.11.0/24"
  availability_zone = "us-east-1b"

  tags = {
    Name = "db-subnet-b"
    Tier = "database"
  }
}

resource "aws_network_acl_rule" "database_inbound" {
  network_acl_id = aws_network_acl.database.id
  rule_number    = 100
  egress         = false
  protocol       = "tcp"
  rule_action    = "allow"
  cidr_block     = "10.0.10.0/24"
  from_port      = 5432
  to_port        = 5432
}

Artifact 4: Log Lines¶

[2024-08-20T03:14:22Z] postgresql-1 | FATAL:  could not connect to the primary server: connection refused
[2024-08-20T03:14:22Z] postgresql-1 | Is the server running on host "postgresql-0.postgresql-headless.databases.svc" and accepting TCP/IP connections on port 5432?
[2024-08-18T11:30:05Z] postgresql-0 | LOG:  checkpoint complete: wrote 14832 buffers (11.3%); 0 WAL file(s) added

Your Mission¶

Reconstruct: What does this system do? What are its components and purpose?
Diagnose: What is currently broken or degraded, and why?
Propose: What would you do to fix it? What would you check first?