Python for Infrastructure Automation¶

Audience: Linux, cloud, and operations engineers coming from Bash Target Python: 3.11+ Scope: Core Python, operator-grade scripting, HTTP/APIs, AWS, SSH, templating, concurrency, packaging, testing, and production footguns What this is: A practical Bash-to-Python guide What this is not: A complete language reference or a CS textbook

The mission¶

You already know how to glue systems together with Bash. That still matters. Bash is excellent for short-lived shell glue, command composition, package installs, service restarts, and tiny cron jobs.

It starts to rot when you need real data structures, JSON/YAML parsing, retries, parallel fan-out, good error handling, reusable functions, tests, or API clients. That is the decision line.

Mental model

Bash is a text-stream processor.

Python is a data-structure processor.

The moment your shell script starts pretending strings are records, arrays are databases, and jq | awk | cut | sed | xargs is “application logic”, you are already writing Python badly.

Table of contents¶

Why Python and when to switch
Running Python correctly
Core language in operator terms
Data structures that replace Bash pain
Functions, types, and dataclasses
Files, paths, and safe writes
Subprocess and shlex
JSON, YAML, TOML, CSV, and INI
HTTP with requests
Logging and CLI patterns
AWS with boto3
SSH with Paramiko
Jinja2 templates
Concurrency for fleet work
Kubernetes client basics
Project layout, packaging, and tooling
Testing infrastructure code
Security defaults and footguns
Cheat sheet
Drills
Verification notes

1. Why Python and when to switch¶

BASH TERRITORY                          | PYTHON TERRITORY
----------------------------------------|------------------------------------------
one-liners                              | structured data parsing
simple wrappers around commands         | API clients with auth and retries
service restart / package install       | reusable logic and libraries
small cron jobs                         | tests, validation, dry-run, logging
quick grep/awk/sed                      | JSON/YAML/TOML/CSV processing
environment bootstrap                   | anything with branching + recovery

The 100-line rule¶

If your Bash script is over about 100 lines, ask one blunt question:

Is this still glue, or is it now logic?

Glue is fine in Bash. Logic belongs in Python.

Three signals you should switch now¶

You are building data structures with declare -A, positional conventions, or variable name gymnastics.
You are parsing structured formats, especially JSON or YAML.
You need recovery, retries, fallback behavior, validation, or tests.

What Python buys you¶

Real types: integers, booleans, dicts, lists, sets, None
Exceptions instead of mystery exit-code soup
Standard library depth that replaces half your shell dependencies
Good third-party libraries for AWS, HTTP, SSH, Kubernetes, templates, and testing
Readable scripts that other humans can extend without ritual sacrifice

2. Running Python correctly¶

REPL¶

python3
>>> 2 + 2
4
>>> import json
>>> json.dumps({"ok": True})
'{"ok": true}'

Use the REPL the same way you use a scratch shell.

Scripts and shebangs¶

#!/usr/bin/env python3
print("hello")

Use chmod +x script.py, then run ./script.py.

`python` vs `python3`¶

Use this rule:

Outside a virtual environment: prefer python3
Inside an activated virtual environment: python is fine and usually preferred

Why: the python command is intentionally not uniform across Unix-like systems. It may point to Python 3, Python 2 on older systems, or not exist at all. In an active virtual environment, python should point to that environment’s interpreter.

Good shell one-liners¶

python3 -c 'import sys; print(sys.version)'
python3 -m json.tool < data.json
python3 -m http.server 8000
python3 -c 'import secrets; print(secrets.token_urlsafe(32))'

Virtual environments, early not late¶

python3 -m venv .venv
source .venv/bin/activate
python -m pip install requests

Do not learn Python by polluting the system interpreter. That is how small experiments become fossilized bad habits.

3. Core language in operator terms¶

Variables and types¶

host = "web-01"          # str
port = 443                # int
uptime_days = 17.5        # float
enabled = True            # bool
last_error = None         # nothing / null equivalent

In Bash, everything is a string until a command pretends otherwise. In Python, values carry actual types.

Explicit conversion¶

port = int("8080")
ratio = float("99.7")
name = str(42)

Python would rather fail loudly than quietly do nonsense. Good.

Truthiness¶

if not items:         # empty list, dict, set, string -> False
    print("nothing to do")

if value is None:     # check for missing value explicitly
    print("unset")

Use is None, not == None.

Control flow¶

if cpu > 95:
    state = "critical"
elif cpu > 80:
    state = "warning"
else:
    state = "ok"

Loops¶

servers = ["web-01", "web-02", "db-01"]

for server in servers:
    print(server)

for i, server in enumerate(servers, start=1):
    print(i, server)

f-strings¶

host = "web-03"
port = 8443
print(f"connecting to {host}:{port}")
print(f"{'HOST':<20} {'PORT':>5}")
print(f"uptime: {99.734:.1f}%")

This is what printf wished it had become after therapy.

4. Data structures that replace Bash pain¶

Lists¶

servers = ["web-01", "web-02", "db-01"]
servers.append("cache-01")
print(servers[0])
print(servers[-1])

high_ports = [p for p in [80, 443, 8080, 9090] if p > 1024]

Dicts¶

service_state = {
    "sshd": "running",
    "nginx": "running",
    "postgres": "failed",
}

print(service_state["sshd"])
print(service_state.get("cron", "unknown"))

for name, state in service_state.items():
    print(name, state)

Sets¶

seen = {"web-01", "web-02"}
if "web-01" in seen:
    print("duplicate")

Use sets for membership tests and dedupe. They are brutally useful.

Counter¶

from collections import Counter

counts = Counter()
with open("/var/log/syslog", encoding="utf-8", errors="replace") as f:
    for line in f:
        parts = line.split()
        if len(parts) >= 5:
            program = parts[4].split("[")[0].rstrip(":")
            counts[program] += 1

for program, n in counts.most_common(10):
    print(f"{program:<20} {n:>6}")

This replaces a depressing amount of awk | sort | uniq -c nonsense.

defaultdict¶

from collections import defaultdict

hosts_by_role = defaultdict(list)
hosts_by_role["web"].append("web-01")
hosts_by_role["web"].append("web-02")
hosts_by_role["db"].append("db-01")

Comprehensions¶

healthy = [h for h in hosts if h["status"] == "ok"]
ports = {svc["name"]: svc["port"] for svc in services}

Use comprehensions when they stay readable. If it looks like line noise, use a normal loop.

5. Functions, types, and dataclasses¶

Functions¶

def classify_load(value: float) -> str:
    if value >= 10:
        return "critical"
    if value >= 5:
        return "warning"
    return "ok"

Functions are where shell scripts stop being a haunted forest.

Type hints¶

Type hints do not change runtime behavior by themselves. They improve readability and let tools catch dumb mistakes early.

def ports_from_text(lines: list[str]) -> list[int]:
    result: list[int] = []
    for line in lines:
        line = line.strip()
        if line:
            result.append(int(line))
    return result

Use hints on function boundaries first. That gets most of the value with minimal ceremony.

`Optional` and unions¶

def find_host(name: str) -> dict[str, str] | None:
    ...

Dataclasses¶

Use a dict when the shape is loose. Use a dataclass when the shape matters.

from dataclasses import dataclass, field

@dataclass(slots=True)
class Host:
    name: str
    address: str
    port: int = 22
    tags: list[str] = field(default_factory=list)

Why this matters:

named fields instead of mystery dict keys
sane defaults
easy printing and testing
fewer typo bugs

Common bug: mutable defaults¶

Bad:

def add_host(name, tags=[]):
    tags.append(name)
    return tags

Good:

def add_host(name: str, tags: list[str] | None = None) -> list[str]:
    tags = [] if tags is None else tags
    tags.append(name)
    return tags

Dataclasses solve this with field(default_factory=list).

6. Files, paths, and safe writes¶

`pathlib` first¶

from pathlib import Path

path = Path("/etc/myapp/config.yaml")
print(path.name)         # config.yaml
print(path.suffix)       # .yaml
print(path.parent)       # /etc/myapp
print(path.exists())

Prefer pathlib over stringly-typed paths.

Reading and writing text¶

from pathlib import Path

config = Path("config.txt")
text = config.read_text(encoding="utf-8")
config.write_text("enabled=true\n", encoding="utf-8")

For large files, stream them instead of slurping everything into RAM.

with open("/var/log/syslog", encoding="utf-8", errors="replace") as f:
    for line in f:
        process(line)

File modes¶

Mode	Meaning
`"r"`	read
`"w"`	write and truncate immediately
`"a"`	append
`"x"`	create only if missing
`"rb"` / `"wb"`	binary read/write

Safe atomic write¶

If the file matters, do not write directly to the target path.

import os
import tempfile
from pathlib import Path


def atomic_write_text(path: str | Path, content: str, *, encoding: str = "utf-8") -> None:
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)

    fd, tmp_name = tempfile.mkstemp(prefix=f".{path.name}.", suffix=".tmp", dir=path.parent)
    tmp_path = Path(tmp_name)

    try:
        with os.fdopen(fd, "w", encoding=encoding) as f:
            f.write(content)
            f.flush()
            os.fsync(f.fileno())
        tmp_path.replace(path)
    except Exception:
        tmp_path.unlink(missing_ok=True)
        raise

Notes:

create temp file in the same filesystem as the target
flush and fsync() before replacement when durability matters
Path.replace() is the explicit “overwrite target” move

7. Subprocess and shlex¶

Prefer native Python when possible¶

If Python already has a library for the task, use it.

pathlib instead of ls, dirname, basename
json instead of jq for JSON already in your process
csv instead of shell splitting CSV like a maniac
shutil instead of cp, mv, rm in many cases

Safe subprocess pattern¶

import subprocess

result = subprocess.run(
    ["systemctl", "is-active", "nginx"],
    capture_output=True,
    text=True,
    check=False,
    timeout=10,
)

print(result.returncode)
print(result.stdout.strip())
print(result.stderr.strip())

Never join shell words yourself¶

import shlex

cmd = ["ssh", host, "sudo", "systemctl", "restart", service]
print("debug:", shlex.join(cmd))

Use shlex.join() for logging. Use list arguments for execution.

`shell=True` is an escape hatch, not a default¶

Bad:

subprocess.run(f"grep {pattern} {filename}", shell=True)

Good:

subprocess.run(["grep", pattern, filename], check=False)

Use shell=True only when you genuinely need shell syntax such as pipes, globs, redirects, or brace expansion.

8. JSON, YAML, TOML, CSV, and INI¶

JSON¶

import json

payload = json.loads('{"host": "web-01", "port": 443}')
print(payload["host"])

print(json.dumps(payload, indent=2, sort_keys=True))

YAML¶

import yaml

with open("config.yaml", encoding="utf-8") as f:
    data = yaml.safe_load(f)

Use safe_load(), not load().

YAML type surprises¶

YAML will happily interpret values in ways that surprise people. Quote ambiguous values if you care about exact strings.

enabled: true     # boolean
port: 080         # maybe not what you expected in some contexts
name: "true"     # forced string

TOML¶

pyproject.toml made TOML unavoidable. Learn the basics.

import tomllib

with open("pyproject.toml", "rb") as f:
    data = tomllib.load(f)

CSV¶

Do not parse CSV with split(','). That is how quoted commas ruin your afternoon.

import csv

with open("hosts.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row["host"], row["role"])

INI / classic config files¶

from configparser import ConfigParser

cfg = ConfigParser()
cfg.read("app.ini")
port = cfg.getint("server", "port", fallback=8080)

9. HTTP with requests¶

The baseline pattern¶

import requests

resp = requests.get("https://example.com/health", timeout=(3.05, 10))
resp.raise_for_status()
print(resp.json())

Timeouts are mandatory¶

Without a timeout, your code can hang indefinitely.

Remember:

timeout=5 applies to both connect and read timeouts
timeout=(3.05, 10) splits connect and read
these are not full wall-clock budgets for the whole request

Sessions and retries¶

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry


def build_session() -> requests.Session:
    retry = Retry(
        total=5,
        connect=3,
        read=3,
        backoff_factor=0.5,
        status_forcelist=(429, 500, 502, 503, 504),
        allowed_methods=frozenset({"GET", "HEAD", "OPTIONS", "PUT", "DELETE"}),
    )
    adapter = HTTPAdapter(max_retries=retry)
    s = requests.Session()
    s.mount("http://", adapter)
    s.mount("https://", adapter)
    s.headers.update({"User-Agent": "infra-tool/1.0"})
    return s

Retry idempotent operations by default. Retrying a POST that creates money, tickets, or infrastructure can be a career event.

Authentication and secrets¶

import os

api_token = os.environ["API_TOKEN"]
headers = {"Authorization": f"Bearer {api_token}"}

Do not log tokens. Do not hardcode tokens. Do not stick them in git and act surprised later.

10. Logging and CLI patterns¶

Logging, not `print()` spam¶

import logging
import sys


def setup_logging(verbose: bool = False) -> logging.Logger:
    level = logging.DEBUG if verbose else logging.INFO
    logging.basicConfig(
        level=level,
        format="%(asctime)s %(levelname)s %(name)s %(message)s",
        handlers=[logging.StreamHandler(sys.stderr)],
    )
    return logging.getLogger("infra")

Use:

stdout for program output other tools may consume
stderr for logs, warnings, and diagnostics

JSON logging¶

import json
import sys
from datetime import UTC, datetime


def log_json(event: str, **fields) -> None:
    entry = {
        "event": event,
        "ts": datetime.now(UTC).isoformat(),
        **fields,
    }
    print(json.dumps(entry, sort_keys=True), file=sys.stderr)

`argparse` with subcommands¶

import argparse


def build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser(prog="infra-tool")
    sub = parser.add_subparsers(dest="command", required=True)

    check = sub.add_parser("check", help="run health checks")
    check.add_argument("--host", required=True)
    check.add_argument("--verbose", action="store_true")

    restart = sub.add_parser("restart", help="restart a service")
    restart.add_argument("--host", required=True)
    restart.add_argument("--service", required=True)
    restart.add_argument("--dry-run", action="store_true")

    return parser

Config precedence¶

This pattern is non-negotiable for real tools:

defaults in code
config file
environment variables
CLI arguments

The closer the input is to the current execution, the higher the precedence.

Exit codes¶

0 success
non-zero failure
reserve stable exit codes if other automation depends on them

11. AWS with boto3¶

Basic client¶

import boto3
from botocore.exceptions import ClientError

ec2 = boto3.client("ec2", region_name="us-east-1")

Pagination is not optional¶

s3 = boto3.client("s3")


def iter_s3_objects(bucket: str, prefix: str = ""):
    paginator = s3.get_paginator("list_objects_v2")
    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        for obj in page.get("Contents", []):
            yield obj

Common pattern¶

def running_instances_by_tag(tag_key: str, tag_value: str) -> list[dict]:
    paginator = ec2.get_paginator("describe_instances")
    items: list[dict] = []

    for page in paginator.paginate(
        Filters=[
            {"Name": f"tag:{tag_key}", "Values": [tag_value]},
            {"Name": "instance-state-name", "Values": ["running"]},
        ]
    ):
        for reservation in page["Reservations"]:
            for instance in reservation["Instances"]:
                items.append(
                    {
                        "id": instance["InstanceId"],
                        "type": instance["InstanceType"],
                        "ip": instance.get("PrivateIpAddress"),
                    }
                )
    return items

Credentials: accurate simplified mental model¶

Boto3 checks several providers in order and stops at the first one that works. The commonly encountered ones are:

explicit credentials passed to boto3.client()
explicit credentials passed to boto3.Session()
environment variables
assume-role and web-identity providers
IAM Identity Center provider
shared credentials and config files under ~/.aws/
instance or task metadata providers

The full chain is longer and evolves. The important rule is unchanged:

never hardcode credentials.

Error handling¶

try:
    ec2.stop_instances(InstanceIds=[instance_id])
except ClientError as e:
    code = e.response["Error"]["Code"]
    if code == "InvalidInstanceID.NotFound":
        print(f"instance {instance_id} not found")
    else:
        raise

12. SSH with Paramiko¶

Secure default pattern¶

import paramiko


def run_remote_command(host: str, user: str, key_path: str, command: str) -> dict:
    client = paramiko.SSHClient()
    client.load_system_host_keys()
    client.set_missing_host_key_policy(paramiko.RejectPolicy())

    try:
        client.connect(
            hostname=host,
            username=user,
            key_filename=key_path,
            timeout=10,
        )
        stdin, stdout, stderr = client.exec_command(command, timeout=30)
        rc = stdout.channel.recv_exit_status()
        return {
            "host": host,
            "rc": rc,
            "stdout": stdout.read().decode(errors="replace").strip(),
            "stderr": stderr.read().decode(errors="replace").strip(),
        }
    finally:
        client.close()

Lab-only shortcut¶

AutoAddPolicy() is convenient in throwaway labs and risky in production. It accepts unknown host keys automatically. That is trust-on-first-use with less thinking than even OpenSSH usually expects.

When Paramiko is the wrong tool¶

If you are fanning out to hundreds of hosts and basically reinventing Ansible, stop. You are writing the prequel to a future incident report.

13. Jinja2 templates¶

Use templates when generating configs or scripts from structured data.

from jinja2 import Environment, FileSystemLoader

env = Environment(
    loader=FileSystemLoader("templates"),
    trim_blocks=True,
    lstrip_blocks=True,
)

tmpl = env.get_template("nginx.conf.j2")
rendered = tmpl.render(server_name="example.com", upstreams=["10.0.0.1:8080"])

Example template:

server {
    listen 80;
    server_name {{ server_name }};

    location / {
        proxy_pass http://backend;
    }
}

upstream backend {
{% for upstream in upstreams %}
    server {{ upstream }};
{% endfor %}
}

14. Concurrency for fleet work¶

GIL reality¶

The GIL still matters for CPU-bound threads. It matters much less for typical ops work because most infrastructure automation is I/O-bound: HTTP, SSH, DNS, sockets, disk waits.

Free-threaded Python note¶

Modern CPython has experimental free-threaded builds starting in Python 3.13, but you should treat them as an advanced option, not your default operational assumption.

`ThreadPoolExecutor`¶

from concurrent.futures import ThreadPoolExecutor, as_completed
import requests


def check_host(host: str) -> dict:
    try:
        r = requests.get(f"http://{host}:8080/health", timeout=(2, 5))
        return {"host": host, "ok": r.ok, "status": r.status_code}
    except requests.RequestException as e:
        return {"host": host, "ok": False, "error": str(e)}


hosts = ["web-01", "web-02", "web-03"]
results = []

with ThreadPoolExecutor(max_workers=10) as pool:
    futures = [pool.submit(check_host, host) for host in hosts]
    for future in as_completed(futures):
        results.append(future.result())

Concurrency rules for operators¶

cap worker counts
set timeouts everywhere
keep operations idempotent when possible
distinguish retryable failures from fatal ones
use jitter/backoff under load
do not DDoS your own control plane because threading was easy

When to use what¶

Workload	Tool
many slow network calls	`ThreadPoolExecutor`
huge CPU-bound parsing	`multiprocessing`
very high-concurrency async libraries already in play	`asyncio`

For most sysadmin and cloud scripts, threads are the correct boring choice.

15. Kubernetes client basics¶

from kubernetes import client, config

config.load_kube_config()   # or config.load_incluster_config()
v1 = client.CoreV1Api()

pods = v1.list_namespaced_pod(
    namespace="default",
    label_selector="app=myapp",
    limit=200,
)

for pod in pods.items:
    print(pod.metadata.name, pod.status.phase)

Important scale note¶

list_pod_for_all_namespaces() is fine for demos and small clusters. On large clusters it is expensive. Prefer:

namespace scoping
label selectors
field selectors
chunking with limit and _continue
watch when you actually need a stream of updates

CrashLoopBackOff detector¶

def crashing_pods(namespace: str = "default") -> list[str]:
    out: list[str] = []
    resp = v1.list_namespaced_pod(namespace=namespace)
    for pod in resp.items:
        for cs in pod.status.container_statuses or []:
            waiting = cs.state.waiting
            if waiting and waiting.reason == "CrashLoopBackOff":
                out.append(pod.metadata.name)
                break
    return out

16. Project layout, packaging, and tooling¶

A sane small-tool layout¶

infra-tool/
├── pyproject.toml
├── README.md
├── src/
│   └── infra_tool/
│       ├── __init__.py
│       ├── __main__.py
│       ├── cli.py
│       ├── logging.py
│       ├── config.py
│       ├── aws.py
│       └── models.py
└── tests/
    ├── test_cli.py
    ├── test_config.py
    └── test_aws.py

`pyproject.toml`¶

[project]
name = "infra-tool"
version = "0.1.0"
description = "infrastructure automation CLI"
requires-python = ">=3.11"
dependencies = [
  "requests>=2.32",
  "boto3>=1.35",
  "PyYAML>=6.0",
  "click>=8.1",
]

[project.scripts]
infra-tool = "infra_tool.cli:main"

Virtual environments¶

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .

`requirements.txt` vs `pyproject.toml`¶

pyproject.toml is the modern project metadata standard
requirements.txt is still useful for pinned deploy or CI environments
do not confuse “project dependencies” with “fully locked deploy state”

`pip-tools`¶

python -m pip install pip-tools
pip-compile pyproject.toml -o requirements.txt
pip-sync requirements.txt

Good when you want human-declared dependencies and machine-generated pins.

`uv`¶

uv is a strong modern packaging tool. It is fast, useful, and worth knowing.

python -m pip install uv
uv venv .venv
source .venv/bin/activate
uv pip install -e .

Balanced view:

good: fast, modern, replaces several common workflows
not magic: its pip-compatible interface is intentionally not an exact clone of pip and pip-tools
practical rule: use it when it fits your repo and team, not because tool fashion demanded tribute

17. Testing infrastructure code¶

pytest baseline¶

# tests/test_config.py
from infra_tool.config import normalize_port


def test_normalize_port_accepts_string():
    assert normalize_port("443") == 443

`tmp_path`¶

def test_write_config(tmp_path):
    target = tmp_path / "config.txt"
    target.write_text("enabled=true\n", encoding="utf-8")
    assert target.read_text(encoding="utf-8") == "enabled=true\n"

`monkeypatch`¶

def test_reads_env(monkeypatch):
    monkeypatch.setenv("API_TOKEN", "test-token")
    assert get_api_token() == "test-token"

Mocking HTTP¶

from unittest.mock import Mock, patch


def test_health_check_ok():
    fake = Mock()
    fake.ok = True
    fake.status_code = 200
    with patch("requests.get", return_value=fake):
        result = health_check("https://example.com")
    assert result["ok"] is True

Mocking subprocess¶

from unittest.mock import patch
import subprocess


def test_systemctl_status():
    fake = subprocess.CompletedProcess(["systemctl"], 0, "active\n", "")
    with patch("subprocess.run", return_value=fake):
        assert is_service_active("nginx") is True

What to test first¶

parsing and validation
config precedence
retry behavior without waiting in real time
path and file write logic
CLI argument handling
error paths, not just happy paths

18. Security defaults and footguns¶

Secure defaults¶

set timeouts on every network call
verify TLS certs unless you have a real reason not to
reject unknown SSH host keys in production
avoid shell=True with untrusted input
never hardcode secrets
redact secrets from logs
use dry-run mode for destructive operations
page through APIs with pagination
stream large files instead of reading everything at once
separate operator-facing logs from machine-readable output

Footguns¶

1. Hardcoded credentials¶

Bad:

boto3.client("s3", aws_access_key_id="AKIA...", aws_secret_access_key="...")

Use environment variables, shared config, role-based credentials, or a secret store.

2. No timeout on HTTP¶

Bad:

requests.get(url)

Good:

requests.get(url, timeout=(3.05, 10))

Retried POSTs can create duplicates. Know the semantics.

4. `yaml.load()` instead of `yaml.safe_load()`¶

Use safe_load() unless you genuinely need custom object construction.

5. Writing config files in place¶

Partial writes plus process crashes produce cursed half-files. Use atomic replacement.

6. Catching broad `Exception` and hiding context¶

Bad:

try:
    do_work()
except Exception:
    print("failed")

Good:

try:
    do_work()
except FileNotFoundError:
    ...
except PermissionError:
    ...

7. Unbounded concurrency¶

Congratulations, you parallelized your outage.

8. Logging secrets¶

Do not emit bearer tokens, passwords, signed URLs, session cookies, or full cloud API payloads that contain them.

9. CSV with `split(',')`¶

No.

10. Building giant dict soup instead of typed boundaries¶

Loose dicts are fine at the edges. Deep inside the codebase they become a swamp.

19. Cheat sheet¶

Bash -> Python Rosetta stone¶

Bash idea	Python equivalent
`VAR=value`	`var = value`
`${var}` in strings	`f"{var}"`
arrays	`list`
associative arrays	`dict`
`grep` / `awk` pipelines	loops, comprehensions, `Counter`, `re`
exit codes only	exceptions + explicit exit codes
`$(cmd)`	`subprocess.run(...)`
heredoc templates	f-strings or Jinja2
`jq`	`json` module
ad hoc env vars	config precedence

Good defaults¶

Python version target: 3.11+
Run system interpreter as: python3
Run venv interpreter as: python
Paths: pathlib
HTTP: requests.Session + timeout + retries
Files: atomic replace for important writes
CLI: argparse with subcommands
Tests: pytest
Typing: annotate function boundaries first
Secrets: env vars / roles / secret store

Standard library modules worth memorizing¶

pathlib        paths and filesystem work
json           JSON encode/decode
csv            CSV parsing/writing
configparser   INI files
tomllib        TOML parsing
subprocess     external commands
shlex          safe shell quoting for display
collections    Counter, defaultdict, deque
datetime       time handling
logging        production logs
argparse       CLI parsing
concurrent.futures   simple threading/process pools

20. Drills¶

Drill 1 - Parse JSON safely¶

Write a function that accepts a JSON string containing a list of objects, filters for enabled=true, and returns hostnames.

Drill 2 - Atomic config update¶

Write a function that updates /tmp/app.conf with a rendered config string using atomic replacement.

Drill 3 - HTTP health fan-out¶

Given a list of hosts, use ThreadPoolExecutor and requests.Session to collect /health results with timeouts.

Drill 4 - Config precedence¶

Implement: defaults < file < env < CLI.

Drill 5 - Paginated AWS listing¶

List every object in an S3 prefix and count total size without loading all results into memory.

Drill 6 - Kubernetes CrashLoop detector¶

Return names of pods with any container waiting in CrashLoopBackOff, scoped to a namespace.

21. Verification notes¶

This revision intentionally corrected and updated a few areas that commonly go stale:

guidance on python vs python3
free-threaded Python status
datetime.utcnow() deprecation
atomic file writes using mkstemp() and explicit cleanup
secure Paramiko host-key handling
boto3 credential-provider discussion
requests timeout semantics
Kubernetes list-scaling guidance
TOML in the standard library
balanced treatment of uv

Checked against official docs on 2026-03-23¶

PEP 394 - python command guidance
Python docs - free-threaded Python
Python 3.12+ docs - datetime.utcnow() deprecation
Python docs - tempfile.mkstemp() and pathlib.Path.replace()
Requests advanced usage docs
Boto3 credentials docs
Paramiko client docs
Kubernetes API concepts docs
Ansible interpreter discovery and raw module docs
Astral uv docs

Final opinion¶

Python is not magic. It is just the point where your automation stops pretending that strings are a database, grep is a parser, and exit code 1 is “error handling”.

For infrastructure work, that is enough.

Python for Infrastructure Automation¶

The mission¶

Table of contents¶

1. Why Python and when to switch¶

The 100-line rule¶

Three signals you should switch now¶

What Python buys you¶

2. Running Python correctly¶

REPL¶

Scripts and shebangs¶

python vs python3¶

Good shell one-liners¶

Virtual environments, early not late¶

3. Core language in operator terms¶

Variables and types¶

Explicit conversion¶

Truthiness¶

Control flow¶

Loops¶

f-strings¶

4. Data structures that replace Bash pain¶

Lists¶

Dicts¶

Sets¶

Counter¶

defaultdict¶

Comprehensions¶

5. Functions, types, and dataclasses¶

Functions¶

Type hints¶

Optional and unions¶

Dataclasses¶

Common bug: mutable defaults¶

6. Files, paths, and safe writes¶

pathlib first¶

Reading and writing text¶

File modes¶

Safe atomic write¶

7. Subprocess and shlex¶

Prefer native Python when possible¶

Safe subprocess pattern¶

Never join shell words yourself¶

shell=True is an escape hatch, not a default¶

8. JSON, YAML, TOML, CSV, and INI¶

JSON¶

YAML¶

YAML type surprises¶

TOML¶

CSV¶

INI / classic config files¶

9. HTTP with requests¶

The baseline pattern¶

Timeouts are mandatory¶

Sessions and retries¶

Authentication and secrets¶

10. Logging and CLI patterns¶

Logging, not print() spam¶

JSON logging¶

argparse with subcommands¶

Config precedence¶

Exit codes¶

11. AWS with boto3¶

Basic client¶

Pagination is not optional¶

Common pattern¶

Credentials: accurate simplified mental model¶

Error handling¶

12. SSH with Paramiko¶

Secure default pattern¶

Lab-only shortcut¶

When Paramiko is the wrong tool¶

13. Jinja2 templates¶

14. Concurrency for fleet work¶

GIL reality¶

Free-threaded Python note¶

ThreadPoolExecutor¶

Concurrency rules for operators¶

When to use what¶

15. Kubernetes client basics¶

Important scale note¶

`python` vs `python3`¶

`Optional` and unions¶

`pathlib` first¶

`shell=True` is an escape hatch, not a default¶

Logging, not `print()` spam¶

`argparse` with subcommands¶

`ThreadPoolExecutor`¶

`pyproject.toml`¶

`requirements.txt` vs `pyproject.toml`¶

`pip-tools`¶

`uv`¶

`tmp_path`¶

`monkeypatch`¶

4. `yaml.load()` instead of `yaml.safe_load()`¶

6. Catching broad `Exception` and hiding context¶

9. CSV with `split(',')`¶