Decision Tree: Latency Has Increased¶

Category: Incident Triage Starting Question: "Response latency has spiked — where is it?" Estimated traversal: 3-5 minutes Domains: kubernetes, observability, linux-performance, postgresql, redis, networking

The Tree¶

Response latency has spiked — where is it?
│
├── Is it all endpoints or only specific ones?
│   │
│   ├── Specific endpoints only
│   │   │
│   │   ├── Are those endpoints backed by a specific downstream service?
│   │   │   └── Yes → check that downstream's latency independently
│   │   │       `kubectl exec -it <pod> -- curl -w "%{time_total}" http://<downstream>/healthz`
│   │   │       └── Downstream is slow → follow downstream branch below
│   │   │
│   │   └── Do those endpoints involve heavy queries or compute?
│   │       `kubectl logs <pod> | grep -E "slow|duration|took [0-9]+ms"`
│   │       └── Yes → ✅ ACTION: Optimize Query or Add Cache
│   │
│   └── All endpoints (service-wide latency increase)
│       │
│       ├── Is the spike correlated with a recent deployment?
│       │   `kubectl rollout history deployment/<name>`
│       │   │
│       │   ├── Yes, deployed in last 60 min
│       │   │   └── Did new version introduce synchronous calls or removed caching?
│       │   │       └── Likely → ✅ ACTION: Roll Back Deployment
│       │   │
│       │   └── No recent deployment → continue below
│       │
│       ├── Check percentile breakdown: p50 vs p95 vs p99
│       │   (Prometheus: `histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`)
│       │   │
│       │   ├── p50 fine, p99 very high (long tail) → outlier requests / GC pauses / lock contention
│       │   │   │
│       │   │   ├── JVM / Go service? → check GC metrics
│       │   │   │   `kubectl exec -it <pod> -- curl localhost:8080/metrics | grep gc_`
│       │   │   │   └── Long GC pauses → ✅ ACTION: Tune GC / Increase Memory
│       │   │   │
│       │   │   └── Check for DB lock waits
│       │   │       `SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL;`
│       │   │       └── Locks present → ✅ ACTION: Investigate DB Lock Contention
│       │   │
│       │   └── All percentiles high (uniform latency increase)
│       │       │
│       │       ├── Check app CPU/memory
│       │       │   `kubectl top pods -l app=<svc>`
│       │       │   │
│       │       │   ├── CPU > 80% of limit → CPU throttling causing latency
│       │       │   │   └── ✅ ACTION: Scale Pods or Increase CPU Limit
│       │       │   │
│       │       │   └── CPU/memory fine → bottleneck is downstream
│       │       │       → continue to downstream checks
│       │       │
│       │       ├── Check database latency
│       │       │   `SELECT pid, now()-query_start AS duration, query FROM pg_stat_activity
│       │       │    WHERE state='active' ORDER BY duration DESC LIMIT 10;`
│       │       │   │
│       │       │   ├── Long-running queries → ✅ ACTION: Identify and Kill Slow Query / Add Index
│       │       │   │
│       │       │   └── DB connection pool exhausted?
│       │       │       App logs: look for "pool timeout" / "connection refused"
│       │       │       └── Yes → ✅ ACTION: Tune Connection Pool / Scale DB
│       │       │
│       │       ├── Check cache hit rate
│       │       │   `redis-cli -h $REDIS_HOST info stats | grep keyspace_hits`
│       │       │   `redis-cli -h $REDIS_HOST info stats | grep keyspace_misses`
│       │       │   │
│       │       │   ├── Hit rate dropped sharply → cache eviction or restart cleared cache
│       │       │   │   └── ✅ ACTION: Investigate Cache Eviction / Warm Cache
│       │       │   │
│       │       │   └── Hit rate fine → not a cache issue
│       │       │
│       │       └── Check network latency
│       │           │
│       │           ├── DNS resolution time?
│       │           │   `kubectl exec -it <pod> -- time nslookup <dependency>`
│       │           │   └── DNS slow (>10ms) → ✅ ACTION: Fix DNS / Check CoreDNS
│       │           │
│       │           └── Packet loss or high RTT?
│       │               `kubectl exec -it <pod> -- ping -c 20 <dependency-ip>`
│       │               └── Loss > 1% or RTT > 5ms baseline → ✅ ACTION: Investigate Network Path
│       │                   → check cloud provider console for VPC issues

Node Details¶

Check 1: Endpoint-level vs service-wide¶

Command: Look at your APM or: kubectl exec -it <pod> -- curl -w "time_connect: %{time_connect}\ntime_starttransfer: %{time_starttransfer}\ntime_total: %{time_total}\n" -o /dev/null -s http://<endpoint> What you're looking for: Whether slow endpoints share a common dependency (same DB table, same microservice call, same external API). Common pitfall: A single slow endpoint with a missing index can cause connection pool exhaustion that then makes ALL endpoints slow. Treat "all slow" as a symptom, not a root cause.

Check 2: Deployment correlation¶

Command: kubectl rollout history deployment/<name> --revision=<N> to see what changed. Also: git log --since="2 hours ago" --oneline in the service repo. What you're looking for: Timestamp of last rollout compared to onset time in your latency graph. Also check for ConfigMap or Secret changes: kubectl describe deployment | grep -A3 "Last Applied". Common pitfall: A Helm values change that added a synchronous external call (e.g., audit logging on every request) will not show in rollout history — look at values diff.

Check 3: p50/p95/p99 percentile breakdown¶

Command: In Prometheus/Grafana: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="<svc>"}[5m])) by (le)). Compare with p50 on the same panel. What you're looking for: A rising p99 with flat p50 = "long tail" problem (GC, lock contention, outlier slow requests). Rising all percentiles = systemic slowdown (CPU, DB, network). Common pitfall: histogram_quantile requires well-configured bucket boundaries. If all requests land in the last bucket, the quantile is inaccurate — check bucket overflow.

Check 4: CPU throttling¶

Command: kubectl top pods -l app=<svc> -n <namespace> and in Prometheus: rate(container_cpu_cfs_throttled_seconds_total{pod=~"<pod-prefix>.*"}[5m]). Also: kubectl describe pod <pod> | grep -A4 "Limits". What you're looking for: container_cpu_cfs_throttled_seconds_total rising is the definitive signal for CPU throttling, even when kubectl top shows usage below the limit. Common pitfall: kubectl top shows CPU usage, not CPU throttling. A pod using 200m with a 250m limit may be throttled 30% of the time at burst moments, causing latency spikes invisible to top.

Check 5: Database slow queries¶

Command: kubectl exec -it <db-pod> -- psql -U postgres -c "SELECT pid, now()-query_start AS duration, left(query,80) FROM pg_stat_activity WHERE state='active' AND query_start < now()-interval '5 seconds' ORDER BY duration DESC;". What you're looking for: Queries running longer than expected. Any query taking >100ms on OLTP is worth examining. Common pitfall: pg_stat_activity shows only currently-running queries. To catch queries that complete quickly but run frequently, enable pg_stat_statements and check total_exec_time / calls.

Check 6: Cache hit rate¶

Command: redis-cli -h $REDIS_HOST -a $REDIS_PASSWORD info stats | grep -E "keyspace_hits|keyspace_misses|evicted_keys" What you're looking for: evicted_keys increasing rapidly means Redis is evicting data under memory pressure, causing more cache misses and DB load. Hit rate = hits / (hits + misses). Common pitfall: A Redis restart or FLUSHALL (e.g., after a credential rotation) will zero out the cache. The first 5-10 minutes after restart will show 100% miss rate — this is expected and should self-correct.

Check 7: DNS latency¶

Command: kubectl exec -it <pod> -- bash -c 'for i in $(seq 1 10); do time nslookup <dependency-service> 2>&1 | tail -3; done' What you're looking for: Any lookup taking >5ms is suspicious. Consistent >10ms is a problem. Common pitfall: DNS issues often manifest as intermittent latency rather than consistent failure. A single slow lookup can make p99 spike while p50 looks fine. Check CoreDNS pod health: kubectl get pods -n kube-system -l k8s-app=kube-dns.

Terminal Actions¶

Action: Roll Back Deployment¶

Do: 1. kubectl rollout undo deployment/<name> 2. kubectl rollout status deployment/<name> — wait for completion 3. Verify latency drops in metrics within 2-3 minutes Verify: p99 latency returns to baseline. kubectl get pods -l app=<name> all Ready. Runbook: helm_upgrade_failed.md

Action: Scale Pods or Increase CPU Limit¶

Do: 1. Check if HPA exists: kubectl get hpa 2. Manual scale: kubectl scale deployment <name> --replicas=<N+2> 3. If throttling is the root cause (not just load), increase CPU limit: kubectl set resources deployment <name> --limits=cpu=1000m 4. Long-term: set CPU request = limit to prevent throttling Verify: kubectl top pods shows CPU below 70%. Latency p99 drops. Runbook: hpa_not_scaling.md

Action: Identify and Kill Slow Query / Add Index¶

Do: 1. Identify the slow query from pg_stat_activity or pg_stat_statements 2. Kill if blocking others: SELECT pg_terminate_backend(<pid>); 3. Run EXPLAIN (ANALYZE, BUFFERS) <slow query>; to identify missing index 4. Add index in off-peak window: CREATE INDEX CONCURRENTLY idx_name ON table(col); Verify: pg_stat_activity no longer shows long-running queries. App latency drops.

Action: Tune Connection Pool / Scale DB¶

Do: 1. Check current pool config in app: env vars like DB_POOL_SIZE, DB_MAX_CONNECTIONS 2. Check Postgres max_connections: SHOW max_connections; 3. Reduce pool size if oversubscribed (pool_size * pod_count should not exceed max_connections * 0.8) 4. Consider PgBouncer if not already deployed Verify: No more "pool timeout" errors in app logs. Query queue depth drops.

Action: Investigate Cache Eviction / Warm Cache¶

Do: 1. Check Redis memory: redis-cli info memory | grep used_memory_human 2. Check eviction policy: redis-cli config get maxmemory-policy 3. If evicting: increase Redis memory limit or switch to allkeys-lru policy 4. Pre-warm cache by replaying recent reads against DB Verify: keyspace_misses rate drops. App latency returns to baseline.

Action: Fix DNS / Check CoreDNS¶

Do: 1. Check CoreDNS pods: kubectl get pods -n kube-system -l k8s-app=kube-dns 2. Check CoreDNS logs: kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100 3. Check CoreDNS ConfigMap: kubectl get cm -n kube-system coredns -o yaml 4. If CoreDNS is overloaded, scale it: kubectl scale deployment coredns -n kube-system --replicas=3 5. Set ndots:2 in pod spec to reduce unnecessary DNS lookup attempts Verify: nslookup from within pods returns in <5ms. Runbook: dns_resolution.md

Action: Tune GC / Increase Memory¶

Do: 1. For JVM: set JAVA_OPTS="-XX:+UseG1GC -Xms512m -Xmx2g -XX:MaxGCPauseMillis=200" 2. For Go: inspect runtime.ReadMemStats exposed via pprof 3. Increase pod memory limit if heap is constrained: kubectl set resources deployment <name> --limits=memory=2Gi 4. Confirm memory limit increase doesn't trigger OOM: watch kubectl get pods for OOMKilled status Verify: GC pause metrics drop. p99 long-tail latency reduces.

Action: Optimize Query or Add Cache¶

Do: 1. Add EXPLAIN ANALYZE output to identify slow plan 2. Add index if seq scan on large table: CREATE INDEX CONCURRENTLY ... 3. Add application-layer cache for hot, repeated reads 4. If query is unavoidable and slow, add async processing where possible Verify: Endpoint-specific latency drops. DB CPU decreases.

Action: Investigate Network Path¶

Do: 1. Check MTU mismatches: kubectl exec -it <pod> -- ping -M do -s 1450 <ip> 2. Check for packet loss: kubectl exec -it <pod> -- traceroute -n <ip> 3. Check cloud provider VPC flow logs for drops 4. Check if network policy is causing TCP resets: kubectl get networkpolicy -n <namespace> Verify: Round-trip latency returns to <1ms within the cluster.

Escalation: Network / Infrastructure Investigation¶

When: Latency is high across multiple services, DNS and DB are healthy, no recent deployments. Who: Infrastructure / Platform team, Cloud provider support Include in page: Affected services, onset time, traceroute outputs, whether latency correlates with node identity (some nodes affected, others not)

Edge Cases¶

Latency spike only during business hours: May indicate legitimate load growth (horizontal scaling needed) rather than a bug or misconfiguration. Check traffic volume alongside latency.
Latency spike on leader pod only: StatefulSet with leader election — if the leader pod is resource-constrained, all writes slow down. Check kubectl get pods and identify which pod holds the lease.
Latency drops when a specific pod is deleted: That pod may have a corrupted connection pool or local state. Delete it after confirming others can handle load.
Service mesh (Istio/Linkerd): Sidecar proxy overhead adds ~0.5-1ms per hop. After mesh installation, small latency increases are expected. Check istioctl proxy-status for stale config causing extra retries.
Latency spike during node drain: Pod rescheduling causes connection churn. Ensure terminationGracePeriodSeconds is sufficient for in-flight requests to complete.

Cross-References¶

Topic Packs: k8s-ops, observability-deep-dive, linux-performance, postgresql, redis
Runbooks: hpa_not_scaling.md, dns_resolution.md, helm_upgrade_failed.md, oomkilled.md