Investigation: Ansible Playbook Hangs, SSH Agent Forwarding Broken, Root Cause Is Firewall Rule¶
Phase 1: DevOps Tooling Investigation (Dead End)¶
Check the Ansible task:
# devops/ansible/playbooks/rolling-update.yml (relevant task)
- name: Pull updated config from git repo
git:
repo: "git@gitlab.internal:infra/configs.git"
dest: /opt/configs
version: main
accept_hostkey: yes
become: yes
become_user: deploy
The task uses the git module with SSH transport (git@gitlab.internal). It relies on SSH agent forwarding from the Ansible control node — the deploy user's key is forwarded from the control node so it can authenticate to GitLab.
Check the Ansible SSH configuration:
# ansible.cfg
[ssh_connection]
ssh_args = -o ForwardAgent=yes -o ControlMaster=auto -o ControlPersist=60s
Agent forwarding is configured. Test it:
# From the Ansible control node
$ ssh -A deploy@app-server-03 "ssh-add -l"
Could not open a connection to your authentication agent.
SSH agent forwarding is not working on app-server-03. Compare with a working server:
$ ssh -A deploy@app-server-01 "ssh-add -l"
4096 SHA256:abc123... /home/ansible/.ssh/id_ed25519 (ED25519)
Agent forwarding works on app-server-01 but not on app-server-03. The agent socket is not being created on app-server-03.
The Pivot¶
Check the SSH daemon configuration on app-server-03:
Agent forwarding is allowed in sshd_config. Check if the socket is created:
$ ls -la /tmp/ssh-*/
ls: cannot access '/tmp/ssh-*/': No such file or directory
$ echo $SSH_AUTH_SOCK
# (empty)
The agent socket is not created at all. Check the SSH connection with verbose output:
# From the control node
$ ssh -vvv -A deploy@app-server-03 2>&1 | grep -i "agent\|forward"
debug1: Requesting authentication agent forwarding.
debug2: channel 0: request auth-agent-req@openssh.com confirm 0
debug2: channel 1: open confirm rwindow 0 rmax 32768
debug2: channel 1: rcvd adjust 2097152
The agent forwarding is requested and the channel is opened. But on the server side:
# On app-server-03
$ journalctl -u sshd --since "5 min ago" | grep -i agent
Mar 19 16:30:12 app-server-03 sshd[12847]: debug1: agent forwarding: socket bound to /tmp/ssh-xxxx/agent.12847
Mar 19 16:30:12 app-server-03 sshd[12847]: debug1: agent forwarding: channel opened (fd 7)
The socket IS created by sshd. But the deploy user cannot access it? Check:
$ ssh -A deploy@app-server-03 "ls -la /tmp/ssh-*/"
total 0
srwxr-xr-x 1 deploy deploy 0 Mar 19 16:35 agent.28471
$ ssh -A deploy@app-server-03 "SSH_AUTH_SOCK=/tmp/ssh-xxxx/agent.28471 ssh-add -l"
4096 SHA256:abc123... /home/ansible/.ssh/id_ed25519 (ED25519)
The socket exists and the key is accessible! But $SSH_AUTH_SOCK is not set in the user's environment when become: yes is used. The issue is become_user: deploy — when Ansible uses sudo to become the deploy user, the SSH_AUTH_SOCK environment variable is stripped by sudo's env_reset.
But this should also happen on the other servers... unless they have a different sudoers configuration:
# On app-server-01 (working)
$ sudo grep "SSH_AUTH_SOCK\|env_keep" /etc/sudoers /etc/sudoers.d/*
/etc/sudoers.d/ansible: Defaults env_keep += "SSH_AUTH_SOCK"
# On app-server-03 (broken)
$ sudo grep "SSH_AUTH_SOCK\|env_keep" /etc/sudoers /etc/sudoers.d/*
# (no matches)
The old servers have a sudoers drop-in file that preserves SSH_AUTH_SOCK. The new server does not — it was missed during provisioning.
But the Ansible playbook still hangs — it does not fail, it hangs. Why?
Phase 2: Linux Ops Investigation (Root Cause)¶
When the git clone command runs without an SSH agent, git tries to authenticate interactively via SSH to GitLab. SSH without an agent falls back to password authentication, which prompts for a password. Since there is no terminal (Ansible runs non-interactively), SSH waits indefinitely for input. But there is another factor — check why SSH hangs instead of failing:
$ ssh -A deploy@app-server-03
$ GIT_SSH_COMMAND="ssh -v" git clone git@gitlab.internal:infra/configs.git 2>&1 | head -20
OpenSSH_9.2p1, OpenSSL 3.0.13 30 Jan 2024
debug1: Connecting to gitlab.internal [10.0.5.100] port 22 ...
# (hangs here)
SSH to GitLab hangs at the connection stage. Not even reaching authentication. Test:
$ ssh -v git@gitlab.internal 2>&1 | head -10
OpenSSH_9.2p1, OpenSSL 3.0.13 30 Jan 2024
debug1: Connecting to gitlab.internal [10.0.5.100] port 22 ...
# (hangs)
$ nc -zv gitlab.internal 22 -w 5
nc: connect to gitlab.internal (10.0.5.100) port 22 (tcp) failed: Connection timed out
TCP to GitLab on port 22 times out from app-server-03. But from app-server-01:
$ nc -zv gitlab.internal 22 -w 5
Connection to gitlab.internal (10.0.5.100) 22 port [tcp/ssh] succeeded!
app-server-03 cannot reach GitLab on port 22. Check the firewall rules:
# On app-server-03
$ sudo iptables -L -n | grep 22
# (no output — no local iptables rules)
# Check the cloud/network firewall
$ aws ec2 describe-security-groups --group-ids sg-0abc123 \
--query 'SecurityGroups[].IpPermissions[?ToPort==`22`]' --output table
| FromPort | ToPort | IpProtocol | CidrIp |
| 22 | 22 | tcp | 10.0.1.0/24 |
The security group only allows outbound SSH to 10.0.1.0/24 (the management subnet). GitLab is on 10.0.5.0/24 (the tools subnet). The old servers are in a different security group that allows SSH to all internal subnets. The new server was provisioned with a more restrictive security group that does not allow SSH to the tools subnet.
Domain Bridge: Why This Crossed Domains¶
Key insight: The symptom was an Ansible playbook hanging (devops_tooling), the initial investigation pointed to SSH agent forwarding (linux_ops), but the actual root cause was a firewall rule blocking outbound SSH to the GitLab server (networking). This is common because: Ansible playbooks chain multiple SSH hops and connections. Agent forwarding issues, sudo environment problems, and network firewall rules can all cause the same symptom — a hanging task. The failure mode is a silent hang (TCP timeout) rather than an error message, making diagnosis harder.
Root Cause¶
The newly provisioned app-server-03 was assigned a restrictive security group that only allows outbound SSH to the management subnet (10.0.1.0/24), not to the tools subnet (10.0.5.0/24) where GitLab lives. When the Ansible playbook ran git clone via SSH, the TCP connection to GitLab timed out silently. The investigation was complicated by a secondary issue — the sudoers configuration was also missing the SSH_AUTH_SOCK preservation — but the primary block was the firewall rule.