Skip to content

Investigation: SSH Timeout, MTU Mismatch, Fix Is Terraform Variable

Phase 1: Linux Ops Investigation (Dead End)

Try SSH with verbose output:

$ ssh -vvv ec2-user@10.0.12.45
...
debug1: Authentication succeeded (publickey).
debug1: channel 0: new [client-session]
debug1: Requesting no-more-sessions@openssh.com
debug1: Entering interactive session.
debug1: pledge: network
debug1: Sending environment.
debug1: Sending env LANG = en_US.UTF-8
# (hangs here — never gets shell prompt)

Authentication works. The hang is after the encrypted session is established, when the server tries to send the shell prompt and MOTD. This is a data transfer issue, not an auth issue.

Try with AWS Systems Manager Session Manager as a workaround:

$ aws ssm start-session --target i-0a1b2c3d4e5f67890
# (connects successfully)

sh-4.2$ cat /etc/ssh/sshd_config | grep -i dns
UseDNS no

sh-4.2$ cat /etc/profile
# No slow scripts

sh-4.2$ journalctl -u sshd --since "5 min ago"
Mar 19 15:30:12 ip-10-0-12-45 sshd[12847]: Accepted publickey for ec2-user from 10.0.1.5
Mar 19 15:30:12 ip-10-0-12-45 sshd[12847]: pam_unix(sshd:session): session opened for user ec2-user
# (no errors — session opened successfully on the server side)

The server sees the session as open. No PAM issues, no DNS delay, no profile script issues. The data simply is not getting through after the encrypted channel is established.

The Pivot

Small payloads work, large payloads hang. This is a classic MTU/PMTUD (Path MTU Discovery) problem. Test:

# From the VPN host, test with different packet sizes
$ ping -M do -s 1472 10.0.12.45   # 1472 + 28 header = 1500
PING 10.0.12.45 (10.0.12.45) 1472(1500) bytes of data.
# (hangs — no response)

$ ping -M do -s 1400 10.0.12.45   # smaller packet
PING 10.0.12.45 (10.0.12.45) 1400(1428) bytes of data.
1428 bytes from 10.0.12.45: icmp_seq=1 ttl=64 time=1.2 ms

$ ping -M do -s 1372 10.0.12.45   # 1372 + 28 = 1400
1400 bytes from 10.0.12.45: icmp_seq=1 ttl=64 time=0.9 ms

Packets above ~1400 bytes are being dropped. The effective MTU on this path is around 1400, not 1500. Check the instance's MTU:

# Via SSM
sh-4.2$ ip link show ens5
2: ens5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc mq state UP mode DEFAULT group default qlen 1000

The instance has MTU 9001 (AWS jumbo frames). But the VPN tunnel has a lower MTU due to encapsulation overhead.

Phase 2: Networking Investigation (Root Cause)

Check the network path:

VPN Client (MTU 1500) → VPN Tunnel (overhead: ~100 bytes, effective MTU: ~1400)
  → VPC (MTU 9001) → New subnet (MTU 9001) → Instance (MTU 9001)

The new subnet's route table goes through a VPN gateway. The VPN tunnel has encapsulation overhead, reducing the effective MTU to ~1400. The old subnet (10.0.4.0/24) has a different route that does not go through the VPN tunnel for internal traffic — it uses VPC-internal routing.

Check why PMTUD is not working:

# The VPN gateway should send ICMP "Fragmentation Needed" (Type 3, Code 4)
# when it receives a packet too large for the tunnel

# Check the NACL on the new subnet
$ aws ec2 describe-network-acls --filters "Name=association.subnet-id,Values=subnet-0abc123def456" \
    --query 'NetworkAcls[].Entries[?RuleAction==`deny`]' --output table
| Egress | Protocol | RuleAction | RuleNumber | CidrBlock    |
| False  | -1       | deny       | 100        | 0.0.0.0/0    |

Wait, that is a blanket deny at rule 100. Check the full NACL:

$ aws ec2 describe-network-acls --filters "Name=association.subnet-id,Values=subnet-0abc123def456" \
    --query 'NetworkAcls[].Entries' --output table
| Egress | Protocol | PortRange | RuleAction | RuleNumber | CidrBlock     |
| False  | 6        | 22-22     | allow      | 10         | 10.0.0.0/8    |
| False  | 6        | 8080-8080 | allow      | 20         | 10.0.0.0/8    |
| False  | 1        | N/A       | allow      | 30         | 10.0.0.0/8    |
| False  | -1       | N/A       | deny       | 100        | 0.0.0.0/0     |
| True   | 6        | 1024-65535| allow      | 10         | 10.0.0.0/8    |
| True   | -1       | N/A       | deny       | 100        | 0.0.0.0/0     |

The NACL allows ICMP (rule 30 inbound) but only ICMP type 8 (echo request). ICMP type 3 (Destination Unreachable, which includes "Fragmentation Needed") is not allowed. The VPN gateway sends ICMP type 3 code 4 to tell the sender to reduce packet size, but the NACL drops it. PMTUD is broken.

The old subnet uses the default NACL which allows all traffic. The new subnet was created with a restrictive NACL via Terraform, and the Terraform module does not include ICMP type 3 in the allow rules.

Domain Bridge: Why This Crossed Domains

Key insight: The symptom was SSH hanging (linux_ops), the root cause was broken PMTUD due to ICMP type 3 being blocked by a NACL (networking), and the fix requires updating the Terraform module that manages subnet NACLs (cloud). This is common because: MTU issues manifest as application-layer hangs on large payloads. PMTUD depends on ICMP type 3, which is frequently blocked by overly restrictive firewall or NACL rules. Cloud networking adds encapsulation overhead that makes MTU mismatches more common.

Root Cause

The new subnet's Network ACL, created by the Terraform VPC module, allows ICMP echo (ping) but blocks ICMP type 3 (Destination Unreachable / Fragmentation Needed). The VPN tunnel has a lower effective MTU than the VPC's 9001-byte jumbo frames. When SSH sends large packets (shell output, MOTD), the VPN gateway tries to signal the sender to reduce packet size via ICMP type 3 code 4, but the NACL drops these messages. PMTUD fails silently, and large packets are black-holed.