Network Automation Footguns¶
Mistakes that brick devices, cause outages, or make your automation worse than manual operations.
1. Running Automation Without a Diff Step¶
You push a config change script to all 200 switches without previewing what will change. On 12 switches, the "change" overwrites a locally-configured static route that was keeping them reachable. Those switches are now unreachable.
Fix: Always use compare_config() (NAPALM) or generate a diff before committing. For Nornir jobs, run against a sample of devices first and inspect the output before running against all. Never commit_config() without reviewing the diff.
2. Using load_replace_candidate When You Mean load_merge_candidate¶
load_replace_candidate replaces the entire running config with your file. If your file is missing NTP, SNMP, logging, AAA, or any management-plane config, those configurations are deleted on commit. Devices can become unreachable if their management access config gets wiped.
Fix: Use load_merge_candidate unless you explicitly intend a full config replace (and have a complete, verified config file). For replace operations, always do a pre-commit review of the diff to check for unexpected deletions (lines starting with -).
3. Not Handling Nornir Task Failures¶
Nornir does not raise exceptions when tasks fail on individual hosts. Your script prints "All done!" while half your fleet is in a broken state. The errors are silently stored in the result objects.
# Dangerous — appears to succeed
nr.run(task=push_configs)
print("Done")
# Correct — always check
results = nr.run(task=push_configs)
failed = [h for h, r in results.items() if r.failed]
if failed:
raise RuntimeError(f"Failed on {len(failed)} hosts: {failed}")
Fix: Always iterate results and check .failed after every Nornir run. Consider wrapping this in a utility function so it's impossible to call nr.run() without failure checking.
4. Hardcoding Credentials in Automation Scripts¶
A colleague commits netops_password = "SuperS3cret123" to the automation repo. The password is now in git history forever. When the password changes, someone updates the script — and commits it again.
Fix: Load credentials from environment variables (os.environ), a vault (HashiCorp Vault, AWS Secrets Manager), or a .env file that is gitignored. Use Nornir's SimpleInventory with password loaded from environment rather than from hosts.yaml.
import os
from nornir import InitNornir
nr = InitNornir(config_file="config.yaml")
# Inject credentials at runtime — not in hosts.yaml
nr.inventory.defaults.username = os.environ["NET_USERNAME"]
nr.inventory.defaults.password = os.environ["NET_PASSWORD"]
5. Treating Automation Output as Ground Truth Without Verification¶
device.get_interfaces() says all interfaces are up. You report success. In reality, NAPALM returned cached or stale data because the device's SSH session was degraded, and one critical interface is actually down.
Fix: Cross-validate critical state with a second method. For example, after a change, verify with both NAPALM getters and a raw show command. For production changes, use a post-change sleep + re-poll rather than relying on the first read after commit.
6. Ignoring SSH Host Key Verification¶
Disabling host key checking to make automation "easier":
# Dangerous — MITM undetectable
device = ConnectHandler(host="10.0.0.1", ..., look_for_keys=False,
allow_agent=False)
# or in NAPALM:
optional_args={"allow_agent": False, "look_for_keys": False}
An attacker who can intercept the connection sees every command and all credentials. In networks where automation runs over the same OOB segment as user traffic, this is a real risk.
Fix: Use known-hosts verification in production. Pre-populate SSH known_hosts for all managed devices. For lab environments, accept the risk explicitly and do not copy the pattern to prod.
7. Using send_command Timeout Too Short on Slow Devices¶
Older IOS devices running show ip bgp on a large table can take 30+ seconds. The default Netmiko timeout is 10 seconds. The command times out, but Netmiko may have already sent the command — which is still running on the device. Your next command's output gets mixed with the previous command's output.
Fix: Set explicit timeouts based on expected command latency:
output = conn.send_command(
"show ip bgp",
read_timeout=120, # wait up to 2 minutes
expect_string=r"#", # match prompt to know command finished
)
8. Not Using expect_string for Commands That Change the Prompt¶
Some commands drop into interactive sub-shells or change the prompt (e.g., debug, more, configuration mode initiated incorrectly). send_command waits for the default prompt pattern and hangs indefinitely.
Fix: For any command that could change the prompt or require interaction, always set expect_string:
conn.send_command("debug ip bgp", expect_string=r"#|--More--|confirm")
# Handle the 'confirm' case:
if "confirm" in output:
conn.send_command("yes", expect_string=r"#")
9. Assuming NETCONF Is Enabled When SSH Is Available¶
SSH being available on port 22 does not mean NETCONF is enabled on port 830. On Cisco IOS, NETCONF requires:
If NETCONF is not enabled, ncclient raises a connection error that may be confusing (looks like a firewall block rather than a config issue).
Fix: Check NETCONF capability before building automation that depends on it:
nmap -p 830 10.0.0.1 # Is port open?
ssh -p 830 -s netops@10.0.0.1 netconf # Does it respond with NETCONF hello?
On the device: show netconf-yang status (IOS-XE).
10. Modifying the Ansible Inventory While a Playbook Is Running¶
You edit hosts.yaml to add a new switch while a Nornir job is running against the existing inventory. Depending on Python's file reading timing, the job may load a partially-written inventory file and fail with a YAML parse error — or silently miss devices.
Fix: Treat inventory files as read-only during job execution. Use file locks or inventory versioning. For dynamic inventories (Netbox, etc.), ensure the API is read-consistent.
11. Using Ansible ios_command Instead of Resource Modules for Config¶
ios_command runs arbitrary show or config commands. It is not idempotent — running it twice applies the config twice (or errors). It does not diff before applying.
# Non-idempotent — runs the command unconditionally
- name: Set NTP
ios_command:
commands:
- ntp server 10.0.0.1
# Idempotent — checks current state, applies only if needed
- name: Set NTP
cisco.ios.ios_ntp_global:
config:
servers:
- server: 10.0.0.1
prefer: true
state: merged
Fix: Use Ansible network resource modules (ios_bgp_global, eos_interfaces, nxos_vlans, etc.) for configuration tasks. Reserve ios_command for read-only show commands and verification.
12. Forgetting That TextFSM Templates Are Version-Specific¶
NTC templates parse CLI output by matching specific strings. A vendor OS upgrade changes the output format and the template no longer parses correctly — it returns empty data instead of raising an error.
Fix: After every OS upgrade on network devices, re-test your automation parsers. Pin the ntc-templates version in requirements.txt so an ntc-templates update doesn't silently break parsing. Always check for empty results:
result = conn.send_command("show ip interface brief", use_textfsm=True)
if not isinstance(result, list) or len(result) == 0:
raise ValueError(f"TextFSM parsing returned empty on {conn.host} — check template")
13. Running Automation Against Production Without a Dry-Run Path¶
Your script has no dry-run mode. The only way to test is to run it — against production.
Fix: Every automation script should support a --dry-run flag that prints what it would do without committing:
if args.dry_run:
diff = device.compare_config()
print(f"[DRY RUN] {device.hostname}:\n{diff}")
device.discard_config()
else:
device.commit_config()
Test dry-run in a lab against a canary device before any production run.