Skip to content

Interview Gauntlet: Network Latency Spikes Every 30 Seconds

Category: Debugging Difficulty: L2-L3 Duration: 15-20 minutes Domains: Networking, Linux Kernel


Round 1: The Opening

Interviewer: "Your service is experiencing network latency spikes every 30 seconds. Average latency is 5ms, but every 30 seconds it jumps to 500ms for about 2 seconds. Where do you start?"

Strong Answer:

"A periodic, consistent spike pattern like every-30-seconds usually points to a scheduled process rather than random traffic. I'd start by correlating the timing with known scheduled activities: CronJobs (though 30 seconds is unusual for cron), garbage collection cycles (if it's a JVM app), health check sweeps, or Kubernetes probes. First, I'd check if the latency spike is visible at the application level or the network level: curl the service repeatedly with timestamps to see if the response time spikes, and simultaneously run ping to the pod IP to see if raw network latency changes. If ping stays at 1ms but HTTP response time spikes, it's application-level. If ping spikes too, it's network or kernel level. For network investigation, I'd look at ss -s on the host to see connection statistics, and conntrack -S to check for conntrack table issues — dropped entries or table full events. The 30-second periodicity is a strong clue: conntrack entry timeout for established TCP connections defaults to 432000 seconds, but for UDP it's 30 seconds. If there's UDP traffic involved, that's suspicious."

Common Weak Answers:

  • "Check the application for bugs." — Too vague. The periodic nature strongly suggests infrastructure, not application logic.
  • "It's probably garbage collection." — Possible for a JVM app, but the candidate should verify rather than guess. GC pauses don't typically cause network latency visible to ping.
  • "Add more instances to handle the load." — Scaling doesn't fix periodic spikes. If it's a network issue, more pods on the same node will all experience the same problem.

Round 2: The Probe

Interviewer: "You confirm it's network-level — ping latency to the pod spikes too. It's not application-level. You check the conntrack table and see conntrack table full, dropping packet messages in dmesg. What exactly is happening and how do you fix it?"

What the interviewer is testing: Understanding of Linux conntrack, its role in Kubernetes networking, and how to diagnose and resolve table exhaustion.

Strong Answer:

"The conntrack table is a kernel data structure that tracks network connections for stateful firewall rules (iptables/nftables). In Kubernetes, every service uses iptables or IPVS rules for load balancing, and every connection through a Service IP creates a conntrack entry. When the table is full, new connections get dropped, causing latency spikes as TCP retransmits or UDP packets are lost. I'd check the current table size and limit: sysctl net.netfilter.nf_conntrack_count shows current entries, sysctl net.netfilter.nf_conntrack_max shows the limit. If count is at or near max, the table is full. The fix has two parts. Immediate: increase the limit with sysctl -w net.netfilter.nf_conntrack_max=262144 (or higher, depending on available memory — each entry uses about 300 bytes, so 262144 entries is about 75 MB). Persist it in /etc/sysctl.d/99-conntrack.conf. Second, reduce the number of entries by tuning timeouts: net.netfilter.nf_conntrack_tcp_timeout_time_wait=60 (default is 120), net.netfilter.nf_conntrack_tcp_timeout_established=3600 (default is 432000 — 5 days). Reducing the established timeout to 1 hour is aggressive but reasonable if most connections are short-lived. For the 30-second spike pattern specifically: UDP conntrack entries expire after 30 seconds by default (nf_conntrack_udp_timeout=30), and when they expire in bulk, the cleanup might cause contention."

Trap Alert:

If the candidate bluffs here: The interviewer will ask "How much memory does each conntrack entry use?" It's approximately 288-320 bytes depending on the kernel version and whether IPv4 or IPv6. At 1 million entries, that's about 300 MB. Candidates who can't estimate this haven't worked with conntrack at scale. It's fine to say "I recall it's a few hundred bytes per entry, so a million entries is a few hundred megabytes."


Round 3: The Constraint

Interviewer: "Increasing the conntrack table helps, but you're still seeing occasional spikes. The root cause is that this service makes thousands of short-lived connections per second to a Redis cluster — each connection creates a conntrack entry, lives for 50ms, then enters TIME_WAIT. The conntrack table fills up again within hours. How do you fix this properly?"

Strong Answer:

"The root fix is to eliminate the short-lived connections. Thousands of new connections per second to Redis means the application is opening a fresh TCP connection for every Redis command instead of using a connection pool. I'd look at the application's Redis client configuration — most Redis client libraries (Jedis, redis-py, ioredis) support connection pooling. Setting a pool size of 20-50 connections would reduce the connection creation rate from thousands per second to essentially zero during steady state. The pool maintains persistent connections that are reused across requests. This would also improve application performance — TCP handshake overhead for each command adds 0.5-1ms per operation. With a pool, the latency per Redis command drops to the network RTT plus command execution time. If the application can't be changed immediately, a connection-level proxy like Envoy or Twemproxy can pool connections on behalf of the application. Put the proxy as a sidecar in the pod, the app connects to localhost:6379 (the proxy), and the proxy maintains a fixed pool of connections to the actual Redis cluster. Beyond connection pooling, I'd also check whether the application can pipeline Redis commands or use Redis multi/exec to batch operations, further reducing round-trips."

The Senior Signal:

What separates a senior answer: Diagnosing the conntrack problem as a symptom of the real issue: missing connection pooling. Many candidates will keep tuning conntrack settings (bigger tables, shorter timeouts) without asking "why are there so many connections in the first place?" The senior answer goes up the stack from the kernel symptom to the application architecture fix. Also: knowing that sidecar proxy pooling is a viable interim fix when the application code can't change immediately.


Round 4: The Curveball

Interviewer: "The developer says: 'I am using connection pooling — the pool is set to 100 connections. But I have 50 pods, so that's 5,000 persistent connections to Redis. Our Redis Cluster has 6 nodes and each is hitting its maxclients limit of 10,000.' Is this a valid concern?"

Strong Answer:

"Yes, it's a real concern. 50 pods * 100 connections per pod = 5,000 connections per Redis node in the worst case (if traffic is unevenly distributed, some nodes get more). Add other services connecting to the same Redis cluster and you can easily approach the 10,000 maxclients limit. But the fix isn't to reduce the pool size to 2 per pod — that would cause connection contention under load. The options are: first, right-size the pool. 100 connections per pod is likely excessive unless the app has 100 concurrent request handlers all needing Redis simultaneously. A pool of 10-20 per pod is usually enough for most workloads. That drops the total to 500-1,000 connections. Second, use a connection proxy at the pod level — an Envoy sidecar or Redis-aware proxy that multiplexes multiple application-level requests over a smaller number of actual TCP connections. Third, increase maxclients on Redis — the default is 10,000 but it can be raised. The limit is governed by the file descriptor limit: ulimit -n must be higher than maxclients. Each connection uses about 10-15 KB of memory in Redis, so 10,000 connections is about 150 MB of overhead, which is manageable on most production Redis instances. I'd probably combine approaches: right-size the pool to 20 and raise maxclients to 20,000, giving plenty of headroom."

Trap Question Variant:

The right answer requires knowing Redis internals. Candidates who haven't managed Redis at scale might not know about maxclients or connection memory overhead. It's fine to say "I know Redis has a client connection limit but I don't recall the exact default or the per-connection memory overhead. I'd check the Redis configuration and documentation." The interviewer is looking for awareness of the connection scalability problem, not memorized defaults.


Round 5: The Synthesis

Interviewer: "We've gone from a 500ms latency spike to conntrack table exhaustion to connection pooling to Redis client limits. What's the broader lesson about debugging network issues in Kubernetes?"

Strong Answer:

"The lesson is that Kubernetes networking adds layers that are invisible until they break. In a traditional VM deployment, a connection from app to Redis is just a TCP connection. In Kubernetes, that same connection traverses: the application, the pod's network namespace, the veth pair to the host, iptables/IPVS rules for service routing, conntrack for connection tracking, potentially a CNI overlay network (VXLAN encapsulation), and then the same stack in reverse at the destination. Each layer can be the bottleneck, and the symptoms often appear at a different layer than the root cause — we saw latency spikes (application symptom) caused by conntrack table exhaustion (kernel layer) caused by connection churn (application layer). My debugging approach for Kubernetes networking is to work from the outside in: start with the user-visible symptom (latency), then check the network layer (ping, traceroute, packet loss), then the kernel layer (conntrack, iptables counters, netstat), then the application layer (connection patterns, pool configs). And always ask: 'What would this look like without Kubernetes?' If the same app on a VM wouldn't have this issue, the problem is in the Kubernetes networking stack. If it would, the problem is in the application."

What This Sequence Tested:

Round Skill Tested
1 Systematic network latency debugging methodology
2 Linux conntrack mechanics and kernel-level networking
3 Root cause analysis — moving from symptom to architectural fix
4 Distributed system scaling and Redis operational knowledge
5 Kubernetes networking mental model and layer-by-layer debugging

Prerequisite Topic Packs