Python for Infrastructure Footguns¶
Mistakes that turn your automation into a liability.
1. Hardcoding AWS credentials in the script¶
You put aws_access_key_id and aws_secret_access_key directly in your Python file. The script gets committed to Git. The credentials are now in history forever. Someone runs trufflehog on your repo and finds them. AWS sends you a bill for 200 GPU instances mining crypto.
Fix: Use environment variables, AWS profiles, or IAM roles. boto3 automatically checks ~/.aws/credentials, environment variables, and instance metadata. Never put credentials in code.
2. No timeout on HTTP requests¶
Your script calls an internal API with requests.get(url). The server is overloaded and does not respond. Your script hangs forever. The cron job piles up. You now have 30 zombie Python processes consuming memory and holding SSH connections.
Fix: Always set timeout=(connect_timeout, read_timeout) on every request. requests.get(url, timeout=(5, 30)). Use a session with retry configuration for robustness.
3. Using subprocess with shell=True and variables¶
You construct a command with f-strings and pass it to subprocess.run(cmd, shell=True). A server name contains a space or semicolon. Your script either fails silently or executes arbitrary commands. This is a shell injection vulnerability.
Fix: Pass arguments as a list: subprocess.run(["ping", "-c", "1", hostname]). Never interpolate user-controlled values into shell strings. If you need pipes, do the piping in Python.
4. Not paginating AWS API calls¶
You call ec2.describe_instances() and process the results. Your script works in dev (10 instances). In production (2,000 instances), it only processes the first 1,000 and silently ignores the rest. Half your fleet is invisible to your automation.
Fix: Use paginators for every AWS list operation. paginator = client.get_paginator('describe_instances') then iterate over paginator.paginate().
5. Writing to files non-atomically¶
Your script writes a new config file with open(path, 'w').write(content). The process gets killed mid-write. The config file is now half-written. The service reads the partial file and crashes.
Fix: Write to a temporary file in the same directory, then rename. rename() on the same filesystem is atomic on Linux. Use tempfile.mkstemp() and Path.rename().
6. Catching Exception instead of specific exceptions¶
You wrap everything in except Exception: pass because you do not want the script to crash. A typo in a variable name raises NameError. A network failure raises ConnectionError. Both are silently swallowed. Your script reports success when it actually did nothing.
Fix: Catch specific exceptions. except requests.RequestException for HTTP errors. except ClientError for AWS errors. Log what you catch. Let unexpected exceptions propagate and crash loudly.
7. Running fleet operations sequentially¶
You SSH into 500 servers one at a time to check disk space. Each connection takes 2-3 seconds. Your script takes 25 minutes to run. By the time it finishes, the first results are stale.
Fix: Use concurrent.futures.ThreadPoolExecutor with 20-50 workers. The same operation completes in under a minute. Set max_workers based on what the target infrastructure can handle.
8. Loading entire large files into memory¶
Your script reads a 5 GB log file with f.readlines() to grep for errors. Python loads the entire file into memory plus overhead. Your 8 GB server runs out of memory. The OOM killer terminates your script and maybe your application too.
Fix: Stream files line by line: for line in open(path). Use generators instead of list comprehensions for large datasets. Process S3 objects one page at a time with paginators.
9. Using os.system instead of subprocess¶
You use os.system("systemctl restart nginx"). It runs in a subshell. You cannot capture stdout. You cannot capture the exit code reliably. You cannot pass arguments safely. Error handling is impossible.
Fix: Use subprocess.run() with capture_output=True and check=True. It raises CalledProcessError on failure, captures output, and handles arguments safely.
10. No logging in automation scripts¶
Your script runs via cron. It fails. You have no idea why because it only printed to stdout and cron sends email to root. Nobody reads root's email. The script has been failing for 2 weeks.
Fix: Use Python's logging module. Log to stderr and/or a file. Include timestamps, severity levels, and context. At minimum: logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s').