Decision Tree: Latency Has Increased¶
Category: Incident Triage Starting Question: "Response latency has spiked — where is it?" Estimated traversal: 3-5 minutes Domains: kubernetes, observability, linux-performance, postgresql, redis, networking
The Tree¶
Response latency has spiked — where is it?
│
├── Is it all endpoints or only specific ones?
│ │
│ ├── Specific endpoints only
│ │ │
│ │ ├── Are those endpoints backed by a specific downstream service?
│ │ │ └── Yes → check that downstream's latency independently
│ │ │ `kubectl exec -it <pod> -- curl -w "%{time_total}" http://<downstream>/healthz`
│ │ │ └── Downstream is slow → follow downstream branch below
│ │ │
│ │ └── Do those endpoints involve heavy queries or compute?
│ │ `kubectl logs <pod> | grep -E "slow|duration|took [0-9]+ms"`
│ │ └── Yes → ✅ ACTION: Optimize Query or Add Cache
│ │
│ └── All endpoints (service-wide latency increase)
│ │
│ ├── Is the spike correlated with a recent deployment?
│ │ `kubectl rollout history deployment/<name>`
│ │ │
│ │ ├── Yes, deployed in last 60 min
│ │ │ └── Did new version introduce synchronous calls or removed caching?
│ │ │ └── Likely → ✅ ACTION: Roll Back Deployment
│ │ │
│ │ └── No recent deployment → continue below
│ │
│ ├── Check percentile breakdown: p50 vs p95 vs p99
│ │ (Prometheus: `histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))`)
│ │ │
│ │ ├── p50 fine, p99 very high (long tail) → outlier requests / GC pauses / lock contention
│ │ │ │
│ │ │ ├── JVM / Go service? → check GC metrics
│ │ │ │ `kubectl exec -it <pod> -- curl localhost:8080/metrics | grep gc_`
│ │ │ │ └── Long GC pauses → ✅ ACTION: Tune GC / Increase Memory
│ │ │ │
│ │ │ └── Check for DB lock waits
│ │ │ `SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL;`
│ │ │ └── Locks present → ✅ ACTION: Investigate DB Lock Contention
│ │ │
│ │ └── All percentiles high (uniform latency increase)
│ │ │
│ │ ├── Check app CPU/memory
│ │ │ `kubectl top pods -l app=<svc>`
│ │ │ │
│ │ │ ├── CPU > 80% of limit → CPU throttling causing latency
│ │ │ │ └── ✅ ACTION: Scale Pods or Increase CPU Limit
│ │ │ │
│ │ │ └── CPU/memory fine → bottleneck is downstream
│ │ │ → continue to downstream checks
│ │ │
│ │ ├── Check database latency
│ │ │ `SELECT pid, now()-query_start AS duration, query FROM pg_stat_activity
│ │ │ WHERE state='active' ORDER BY duration DESC LIMIT 10;`
│ │ │ │
│ │ │ ├── Long-running queries → ✅ ACTION: Identify and Kill Slow Query / Add Index
│ │ │ │
│ │ │ └── DB connection pool exhausted?
│ │ │ App logs: look for "pool timeout" / "connection refused"
│ │ │ └── Yes → ✅ ACTION: Tune Connection Pool / Scale DB
│ │ │
│ │ ├── Check cache hit rate
│ │ │ `redis-cli -h $REDIS_HOST info stats | grep keyspace_hits`
│ │ │ `redis-cli -h $REDIS_HOST info stats | grep keyspace_misses`
│ │ │ │
│ │ │ ├── Hit rate dropped sharply → cache eviction or restart cleared cache
│ │ │ │ └── ✅ ACTION: Investigate Cache Eviction / Warm Cache
│ │ │ │
│ │ │ └── Hit rate fine → not a cache issue
│ │ │
│ │ └── Check network latency
│ │ │
│ │ ├── DNS resolution time?
│ │ │ `kubectl exec -it <pod> -- time nslookup <dependency>`
│ │ │ └── DNS slow (>10ms) → ✅ ACTION: Fix DNS / Check CoreDNS
│ │ │
│ │ └── Packet loss or high RTT?
│ │ `kubectl exec -it <pod> -- ping -c 20 <dependency-ip>`
│ │ └── Loss > 1% or RTT > 5ms baseline → ✅ ACTION: Investigate Network Path
│ │ → check cloud provider console for VPC issues
Node Details¶
Check 1: Endpoint-level vs service-wide¶
Command: Look at your APM or: kubectl exec -it <pod> -- curl -w "time_connect: %{time_connect}\ntime_starttransfer: %{time_starttransfer}\ntime_total: %{time_total}\n" -o /dev/null -s http://<endpoint>
What you're looking for: Whether slow endpoints share a common dependency (same DB table, same microservice call, same external API).
Common pitfall: A single slow endpoint with a missing index can cause connection pool exhaustion that then makes ALL endpoints slow. Treat "all slow" as a symptom, not a root cause.
Check 2: Deployment correlation¶
Command: kubectl rollout history deployment/<name> --revision=<N> to see what changed. Also: git log --since="2 hours ago" --oneline in the service repo.
What you're looking for: Timestamp of last rollout compared to onset time in your latency graph. Also check for ConfigMap or Secret changes: kubectl describe deployment | grep -A3 "Last Applied".
Common pitfall: A Helm values change that added a synchronous external call (e.g., audit logging on every request) will not show in rollout history — look at values diff.
Check 3: p50/p95/p99 percentile breakdown¶
Command: In Prometheus/Grafana: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="<svc>"}[5m])) by (le)). Compare with p50 on the same panel.
What you're looking for: A rising p99 with flat p50 = "long tail" problem (GC, lock contention, outlier slow requests). Rising all percentiles = systemic slowdown (CPU, DB, network).
Common pitfall: histogram_quantile requires well-configured bucket boundaries. If all requests land in the last bucket, the quantile is inaccurate — check bucket overflow.
Check 4: CPU throttling¶
Command: kubectl top pods -l app=<svc> -n <namespace> and in Prometheus: rate(container_cpu_cfs_throttled_seconds_total{pod=~"<pod-prefix>.*"}[5m]). Also: kubectl describe pod <pod> | grep -A4 "Limits".
What you're looking for: container_cpu_cfs_throttled_seconds_total rising is the definitive signal for CPU throttling, even when kubectl top shows usage below the limit.
Common pitfall: kubectl top shows CPU usage, not CPU throttling. A pod using 200m with a 250m limit may be throttled 30% of the time at burst moments, causing latency spikes invisible to top.
Check 5: Database slow queries¶
Command: kubectl exec -it <db-pod> -- psql -U postgres -c "SELECT pid, now()-query_start AS duration, left(query,80) FROM pg_stat_activity WHERE state='active' AND query_start < now()-interval '5 seconds' ORDER BY duration DESC;".
What you're looking for: Queries running longer than expected. Any query taking >100ms on OLTP is worth examining.
Common pitfall: pg_stat_activity shows only currently-running queries. To catch queries that complete quickly but run frequently, enable pg_stat_statements and check total_exec_time / calls.
Check 6: Cache hit rate¶
Command: redis-cli -h $REDIS_HOST -a $REDIS_PASSWORD info stats | grep -E "keyspace_hits|keyspace_misses|evicted_keys"
What you're looking for: evicted_keys increasing rapidly means Redis is evicting data under memory pressure, causing more cache misses and DB load. Hit rate = hits / (hits + misses).
Common pitfall: A Redis restart or FLUSHALL (e.g., after a credential rotation) will zero out the cache. The first 5-10 minutes after restart will show 100% miss rate — this is expected and should self-correct.
Check 7: DNS latency¶
Command: kubectl exec -it <pod> -- bash -c 'for i in $(seq 1 10); do time nslookup <dependency-service> 2>&1 | tail -3; done'
What you're looking for: Any lookup taking >5ms is suspicious. Consistent >10ms is a problem.
Common pitfall: DNS issues often manifest as intermittent latency rather than consistent failure. A single slow lookup can make p99 spike while p50 looks fine. Check CoreDNS pod health: kubectl get pods -n kube-system -l k8s-app=kube-dns.
Terminal Actions¶
Action: Roll Back Deployment¶
Do:
1. kubectl rollout undo deployment/<name>
2. kubectl rollout status deployment/<name> — wait for completion
3. Verify latency drops in metrics within 2-3 minutes
Verify: p99 latency returns to baseline. kubectl get pods -l app=<name> all Ready.
Runbook: helm_upgrade_failed.md
Action: Scale Pods or Increase CPU Limit¶
Do:
1. Check if HPA exists: kubectl get hpa
2. Manual scale: kubectl scale deployment <name> --replicas=<N+2>
3. If throttling is the root cause (not just load), increase CPU limit: kubectl set resources deployment <name> --limits=cpu=1000m
4. Long-term: set CPU request = limit to prevent throttling
Verify: kubectl top pods shows CPU below 70%. Latency p99 drops.
Runbook: hpa_not_scaling.md
Action: Identify and Kill Slow Query / Add Index¶
Do:
1. Identify the slow query from pg_stat_activity or pg_stat_statements
2. Kill if blocking others: SELECT pg_terminate_backend(<pid>);
3. Run EXPLAIN (ANALYZE, BUFFERS) <slow query>; to identify missing index
4. Add index in off-peak window: CREATE INDEX CONCURRENTLY idx_name ON table(col);
Verify: pg_stat_activity no longer shows long-running queries. App latency drops.
Action: Tune Connection Pool / Scale DB¶
Do:
1. Check current pool config in app: env vars like DB_POOL_SIZE, DB_MAX_CONNECTIONS
2. Check Postgres max_connections: SHOW max_connections;
3. Reduce pool size if oversubscribed (pool_size * pod_count should not exceed max_connections * 0.8)
4. Consider PgBouncer if not already deployed
Verify: No more "pool timeout" errors in app logs. Query queue depth drops.
Action: Investigate Cache Eviction / Warm Cache¶
Do:
1. Check Redis memory: redis-cli info memory | grep used_memory_human
2. Check eviction policy: redis-cli config get maxmemory-policy
3. If evicting: increase Redis memory limit or switch to allkeys-lru policy
4. Pre-warm cache by replaying recent reads against DB
Verify: keyspace_misses rate drops. App latency returns to baseline.
Action: Fix DNS / Check CoreDNS¶
Do:
1. Check CoreDNS pods: kubectl get pods -n kube-system -l k8s-app=kube-dns
2. Check CoreDNS logs: kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100
3. Check CoreDNS ConfigMap: kubectl get cm -n kube-system coredns -o yaml
4. If CoreDNS is overloaded, scale it: kubectl scale deployment coredns -n kube-system --replicas=3
5. Set ndots:2 in pod spec to reduce unnecessary DNS lookup attempts
Verify: nslookup from within pods returns in <5ms.
Runbook: dns_resolution.md
Action: Tune GC / Increase Memory¶
Do:
1. For JVM: set JAVA_OPTS="-XX:+UseG1GC -Xms512m -Xmx2g -XX:MaxGCPauseMillis=200"
2. For Go: inspect runtime.ReadMemStats exposed via pprof
3. Increase pod memory limit if heap is constrained: kubectl set resources deployment <name> --limits=memory=2Gi
4. Confirm memory limit increase doesn't trigger OOM: watch kubectl get pods for OOMKilled status
Verify: GC pause metrics drop. p99 long-tail latency reduces.
Action: Optimize Query or Add Cache¶
Do:
1. Add EXPLAIN ANALYZE output to identify slow plan
2. Add index if seq scan on large table: CREATE INDEX CONCURRENTLY ...
3. Add application-layer cache for hot, repeated reads
4. If query is unavoidable and slow, add async processing where possible
Verify: Endpoint-specific latency drops. DB CPU decreases.
Action: Investigate Network Path¶
Do:
1. Check MTU mismatches: kubectl exec -it <pod> -- ping -M do -s 1450 <ip>
2. Check for packet loss: kubectl exec -it <pod> -- traceroute -n <ip>
3. Check cloud provider VPC flow logs for drops
4. Check if network policy is causing TCP resets: kubectl get networkpolicy -n <namespace>
Verify: Round-trip latency returns to <1ms within the cluster.
Escalation: Network / Infrastructure Investigation¶
When: Latency is high across multiple services, DNS and DB are healthy, no recent deployments. Who: Infrastructure / Platform team, Cloud provider support Include in page: Affected services, onset time, traceroute outputs, whether latency correlates with node identity (some nodes affected, others not)
Edge Cases¶
- Latency spike only during business hours: May indicate legitimate load growth (horizontal scaling needed) rather than a bug or misconfiguration. Check traffic volume alongside latency.
- Latency spike on leader pod only: StatefulSet with leader election — if the leader pod is resource-constrained, all writes slow down. Check
kubectl get podsand identify which pod holds the lease. - Latency drops when a specific pod is deleted: That pod may have a corrupted connection pool or local state. Delete it after confirming others can handle load.
- Service mesh (Istio/Linkerd): Sidecar proxy overhead adds ~0.5-1ms per hop. After mesh installation, small latency increases are expected. Check
istioctl proxy-statusfor stale config causing extra retries. - Latency spike during node drain: Pod rescheduling causes connection churn. Ensure
terminationGracePeriodSecondsis sufficient for in-flight requests to complete.
Cross-References¶
- Topic Packs: k8s-ops, observability-deep-dive, linux-performance, postgresql, redis
- Runbooks: hpa_not_scaling.md, dns_resolution.md, helm_upgrade_failed.md, oomkilled.md