The Phantom Latency Spike¶
Category: The Mystery Domains: linux-performance, networking Read time: ~5 min
Setting the Scene¶
Our API gateway served around 12,000 requests per second for a fintech platform. Performance was solid -- p99 latency under 40ms -- except for one maddening anomaly. Every single day, at exactly 2:47 PM UTC, latency would spike to 800ms+ for exactly 90 seconds, then return to normal as if nothing had happened.
I stared at this pattern in Grafana for a week before deciding it was real and not just a coincidence.
What Happened¶
My first theory was garbage collection. We were running Java services on the gateway tier, and a 90-second GC pause would explain everything. I spent two days adding GC logging flags (-Xlog:gc*:file=/var/log/gc.log:time,uptime,level), analyzing heap dumps, tuning G1GC ergonomics. The GC logs showed nothing unusual at 2:47 PM.
Next, I suspected a traffic pattern. Maybe a batch client was hammering us at that exact time. I wrote a custom script to aggregate access logs by source IP in 10-second windows. Traffic distribution was perfectly flat through the spike. No single client was misbehaving.
Third theory: database connection pool exhaustion. The gateway hit a PostgreSQL cluster for auth token validation. I added connection pool metrics, watched pg_stat_activity during the spike window. Nothing. Database was at 15% connection utilization.
I was starting to doubt my own dashboards. Then a colleague casually mentioned: "That gateway box shares a rack with the backup server, right?" I hadn't thought about the physical infrastructure in months. We were mostly a "cloud-native" shop, but these particular gateway nodes ran on bare metal in our colo because of latency requirements.
I SSH'd into the backup server and ran crontab -l. There it was: 47 14 * * * /opt/backup/full-snapshot.sh. Every day at 2:47 PM, a full rsync backup of 400GB of logs to an NFS target over the same 10Gbps uplink our gateway traffic traversed. The backup saturated the shared switch port for exactly 90 seconds.
The Moment of Truth¶
I ran iftop on the backup server during the 2:47 window and watched it push 9.7 Gbps through the shared uplink. Then I checked sar -n DEV 1 on the gateway box at the same time -- TX queue drops, packet retransmits, the whole picture snapped into focus. Our gateway was competing for bandwidth with a firehose backup job.
The Aftermath¶
We moved the backup job to 3:00 AM and put it on a dedicated NIC connected to a separate switch. The phantom spike vanished immediately. Total fix time: 20 minutes of actual work, after two weeks of investigation. I added network utilization panels to every host dashboard that same afternoon.
The Lessons¶
- Correlate with time patterns: When something happens at the exact same time every day,
crontab -lon every box in the blast radius should be your first move. - Shared resources cause shared pain: A backup server on a shared network segment can silently degrade unrelated services. Physical topology matters even in "software-defined" environments.
- Isolate workloads by network path: Bulk data transfer and latency-sensitive traffic should never share a last-mile link without QoS or physical separation.
What I'd Do Differently¶
Maintain a registry of all cron jobs across the fleet, searchable by time window. Add network saturation alerts (not just errors) on every host NIC. And never, ever assume the problem is in your application when the symptom is a perfect time-based pattern.
The Quote¶
"Two weeks chasing a ghost in the application layer, and the answer was a cron job on a box I'd forgotten existed."