Skip to content

The SSL Handshake Timeout

Category: The Mystery Domains: tls, networking Read time: ~5 min


Setting the Scene

We ran a B2B SaaS platform serving about 300 enterprise customers. Each customer connected from their own corporate network -- different ISPs, different firewalls, different everything. One Thursday, our largest customer called: their entire engineering team couldn't reach our API. HTTPS connections timed out after 30 seconds. HTTP (port 80, which we kept for redirects) worked fine. Every other customer was unaffected.

I was the infrastructure lead, and my first thought was "their firewall is blocking port 443." I was wrong, but it took me five days to figure out why.

What Happened

The customer's network team confirmed port 443 was open. They could telnet to our IP on 443 and get a TCP connection. But the TLS handshake never completed. openssl s_client -connect api.example.com:443 from their office hung indefinitely after CONNECTED(00000003). From my laptop, the same command completed in 80ms.

Day one, I blamed their corporate proxy. Many enterprise networks intercept TLS for inspection. They swore they didn't. I asked them to bypass it anyway and try from a direct connection. Same result.

Day two, I focused on our side. Maybe our TLS configuration rejected their cipher suite. I checked our nginx config -- we supported TLS 1.2 and 1.3 with a broad cipher list. I spun up a test endpoint with ALL ciphers enabled. Still timed out from their network.

Day three, I tried to reproduce from other locations. I used curl --connect-to from five different cloud providers. All worked. I asked the customer to try from a coffee shop. It worked. Only their office network was affected, but only for TLS -- TCP connected fine.

Day four, I was desperate enough to drive to their office with a laptop. I plugged into their network and confirmed the issue. I ran tcpdump -i eth0 -s 0 port 443 -w /tmp/tls-debug.pcap on both my laptop (client side) and our server simultaneously.

On the client capture, I could see the TCP three-way handshake complete. Then the ClientHello went out -- but it was fragmented across two TCP segments. The first segment was 1,460 bytes, the second was 87 bytes. On the server capture, I saw the SYN/ACK and the first TCP segment of the ClientHello arrive... but the second segment never showed up.

I checked the path MTU. Their office network had an MTU of 1,500 (standard), but they were on a site-to-site VPN to a data center that used an MPLS overlay with an effective MTU of 1,400. TCP should handle this via Path MTU Discovery -- but their edge firewall was blocking ICMP "Fragmentation Needed" messages. So the sending host never learned it needed to send smaller packets.

The Moment of Truth

The TLS ClientHello was larger than usual because the customer's systems included a long list of SNI extensions and supported cipher suites, pushing the handshake packet past 1,460 bytes. It got fragmented at the VPN boundary, and a firewall rule dropped the second fragment (their security policy dropped all IP fragments as a "hardening measure"). The first fragment alone wasn't a complete TLS ClientHello, so our server waited forever for the rest.

A quick ping -M do -s 1400 api.example.com from their network confirmed it: packets over 1,400 bytes were black-holed.

The Aftermath

The customer's network team added an ICMP exception for type 3, code 4 (Fragmentation Needed) and adjusted their VPN tunnel MTU to 1,400 with TCP MSS clamping (iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1360). We also reduced our TLS certificate chain size by removing an unnecessary intermediate cert, which shrunk our ServerHello enough to avoid fragmentation in the other direction.

The Lessons

  1. MTU mismatches cause bizarre symptoms: When PMTUD is broken (usually by overzealous ICMP blocking), connections fail in ways that look like application or TLS bugs. Always check path MTU when TLS works from some networks but not others.
  2. Test from multiple network paths: "Works from AWS" doesn't mean it works from a corporate network behind a VPN with fragment-dropping firewall rules. Test from the actual client network.
  3. tcpdump before guessing: Two hours of simultaneous packet captures told me more than four days of theory. When TLS handshakes fail silently, packet captures on both ends are the fastest path to root cause.

What I'd Do Differently

Add a network diagnostic page at a known URL that tests various payload sizes and reports MTU issues to the client. Monitor TLS handshake completion rates by source ASN to catch network-specific failures early. And never spend four days guessing when tcpdump can answer the question in ten minutes.

The Quote

"TCP connected fine. HTTP worked fine. But a 87-byte fragment carrying the tail end of a TLS ClientHello vanished into a firewall rule that someone set and forgot three years ago."

Cross-References