Thinking Out Loud: SSH Deep Dive¶
A senior SRE's internal monologue while working through a real SSH issue. This isn't a tutorial — it's a window into how experienced engineers actually think.
The Situation¶
After a routine OS upgrade on a jump host, several team members report they can no longer SSH to production servers. The errors vary: "Host key verification failed," "Permission denied (publickey)," and "Connection refused." I can still connect because I was already connected before the upgrade.
The Monologue¶
Three different SSH errors across team members after a jump host upgrade. Each error means something completely different. Let me categorize them.
"Connection refused" = sshd isn't running or listening on the expected port. "Host key verification failed" = the server's host key changed (the upgrade regenerated it). "Permission denied (publickey)" = authentication is failing.
These could all be caused by the same upgrade, but they're different symptoms. Let me tackle them in order of severity.
First: is sshd running on the jump host?
sshd is running and listening on port 22. So "Connection refused" must be something else. Let me check — are those users trying to connect to the jump host or through it?
Ah, they're trying to connect THROUGH the jump host to production servers. The "Connection refused" might be the ProxyJump failing, which surfaces as "Connection refused" on the downstream connection.
Mental Model: SSH Error Location Matters¶
SSH error messages can be misleading about WHERE the failure occurs. "Connection refused" might mean the final target isn't reachable, or it might mean the ProxyJump/ProxyCommand failed and the error is from the jump host, not the target. Always determine whether the error is from the client -> jump host leg or the jump host -> target leg.
Let me test the two legs separately.
# Test client -> jump host
ssh -v jumphost.internal 'echo "jump host reached"'
# Test jump host -> production server
ssh jumphost.internal 'ssh -v prod-web-1.internal echo "prod reached"'
Client to jump host works for me (I have the right host key fingerprint in my known_hosts). Let me check what the users with "Host key verification failed" are seeing.
The OS upgrade regenerated the jump host's SSH host keys. Every user who has the old host key fingerprint in their ~/.ssh/known_hosts will see the MITM warning. They need to update their known_hosts.
# Users need to remove the old host key
ssh-keygen -R jumphost.internal
# Then reconnect (and accept the new key)
ssh jumphost.internal
But wait — before I tell everyone to just accept the new key, let me verify it's actually the new key from the legitimate server and not an actual MITM attack. I know this is paranoid, but the one time you skip this check...
# Get the host key fingerprint from the server directly (via console/IPMI)
ssh-keyscan jumphost.internal 2>/dev/null | ssh-keygen -lf -
I'll compare this fingerprint with what the users are seeing in their error messages. If they match, it's safe to accept.
Now, the "Permission denied (publickey)" issue. This is the most interesting one. The OS upgrade might have changed the SSH configuration.
PubkeyAuthentication yes — that's correct. AuthorizedKeysFile .ssh/authorized_keys — standard. But wait, there's a new line: PubkeyAcceptedAlgorithms -ssh-rsa. The upgrade added a config that DISABLES the ssh-rsa algorithm.
Mental Model: SSH Algorithm Evolution¶
Modern SSH versions are deprecating older algorithms (ssh-rsa with SHA-1). Ubuntu 22.04+ and OpenSSH 8.8+ disable ssh-rsa by default. If users have RSA keys and connect to a server that no longer accepts ssh-rsa, they get "Permission denied" even though their key is in authorized_keys. The fix is either to re-enable ssh-rsa (temporary) or upgrade to Ed25519 keys (proper).
This is it. Users with old RSA keys (generated with ssh-keygen -t rsa) are being rejected because the server no longer accepts the ssh-rsa algorithm. Let me check which users are affected.
# Check what key types are in authorized_keys
for user_dir in /home/*; do
username=$(basename $user_dir)
if [ -f "$user_dir/.ssh/authorized_keys" ]; then
echo -n "$username: "
awk '{print $1}' "$user_dir/.ssh/authorized_keys" | sort | uniq -c
fi
done
Most users have ssh-rsa keys. A few have ssh-ed25519. The Ed25519 users can connect fine. The RSA users are blocked.
I have two options: 1. Re-enable ssh-rsa temporarily so everyone can connect, then migrate to Ed25519 2. Tell everyone to generate new Ed25519 keys immediately
Option 1 is the right call during business hours. I can't tell 20 people to regenerate and distribute new keys right now while they're locked out of production.
# Temporarily re-enable ssh-rsa
sudo sed -i 's/PubkeyAcceptedAlgorithms -ssh-rsa/PubkeyAcceptedAlgorithms +ssh-rsa/' /etc/ssh/sshd_config
sudo systemctl reload sshd
Now let me set a deadline for key migration. Users have 30 days to switch to Ed25519.
# What users should do:
ssh-keygen -t ed25519 -C "user@company.com"
ssh-copy-id -i ~/.ssh/id_ed25519.pub jumphost.internal
Let me also fix the "Connection refused" issue for users going through the jump host. Let me check if the ProxyJump/agent forwarding is working.
"Could not open a connection to your authentication agent." Agent forwarding might be disabled in the new sshd config.
AllowAgentForwarding no. The upgrade set agent forwarding to disabled. Users who rely on ProxyJump with agent forwarding can't reach the production servers because their key isn't forwarded.
sudo sed -i 's/AllowAgentForwarding no/AllowAgentForwarding yes/' /etc/ssh/sshd_config
sudo systemctl reload sshd
Let me verify all three issues are fixed.
# Test full path: client -> jump host -> prod server
ssh -J jumphost.internal prod-web-1.internal 'hostname'
Works. Let me have someone from the team try too, to verify the host key and authentication fixes work for them.
Now, for the long-term fixes: 1. Pin the SSH configuration in our Ansible playbook so OS upgrades don't change security settings 2. Migrate all users to Ed25519 keys within 30 days 3. Add the jump host's new host key fingerprint to our team wiki 4. Set up a pre-upgrade checklist that includes SSH config verification
# Add to the Ansible SSH role:
# - sshd_config is templated and controlled by Ansible
# - OS package upgrades won't override our config
What Made This Senior-Level¶
| Junior Would... | Senior Does... | Why |
|---|---|---|
| Tell users to just remove the known_hosts entry and reconnect | Verify the new host key fingerprint before telling users to accept it | The one time you skip fingerprint verification is the one time there's an actual MITM |
| Not understand why "Permission denied" appeared after an upgrade | Know that modern SSH versions deprecate ssh-rsa and check the PubkeyAcceptedAlgorithms config |
Algorithm deprecation is the #1 cause of post-upgrade SSH authentication failures |
| Re-enable the old algorithm and forget about it | Re-enable temporarily AND set a 30-day deadline for key migration | The temporary fix buys time; the migration fixes the underlying security improvement |
| Not connect the upgrade to the agent forwarding issue | Check all SSH-related configuration changes the upgrade might have introduced | OS upgrades can change multiple SSH settings, not just the one that produced the first error |
Key Heuristics Used¶
- Categorize SSH Errors: "Connection refused," "Host key verification failed," and "Permission denied" are three completely different problems. Don't conflate them.
- Algorithm Deprecation Awareness: Modern OpenSSH disables ssh-rsa. Check
PubkeyAcceptedAlgorithmswhen authentication fails after upgrades. - Fix Now, Migrate Later: Re-enable deprecated algorithms as a temporary fix to unblock users, then set a firm deadline for migrating to modern key types.
Cross-References¶
- Primer — SSH architecture, key types, authentication flow, and agent forwarding
- Street Ops — SSH debugging with -v flags, key management commands, and config troubleshooting
- Footguns — OS upgrades changing SSH config, accepting host keys without verification, and relying on deprecated algorithms