Skip to content

Python for Infrastructure Automation

Audience: Linux, cloud, and operations engineers coming from Bash Target Python: 3.11+ Scope: Core Python, operator-grade scripting, HTTP/APIs, AWS, SSH, templating, concurrency, packaging, testing, and production footguns What this is: A practical Bash-to-Python guide What this is not: A complete language reference or a CS textbook


The mission

You already know how to glue systems together with Bash. That still matters. Bash is excellent for short-lived shell glue, command composition, package installs, service restarts, and tiny cron jobs.

It starts to rot when you need real data structures, JSON/YAML parsing, retries, parallel fan-out, good error handling, reusable functions, tests, or API clients. That is the decision line.

Mental model

  • Bash is a text-stream processor.
  • Python is a data-structure processor.

The moment your shell script starts pretending strings are records, arrays are databases, and jq | awk | cut | sed | xargs is “application logic”, you are already writing Python badly.


Table of contents

  1. Why Python and when to switch
  2. Running Python correctly
  3. Core language in operator terms
  4. Data structures that replace Bash pain
  5. Functions, types, and dataclasses
  6. Files, paths, and safe writes
  7. Subprocess and shlex
  8. JSON, YAML, TOML, CSV, and INI
  9. HTTP with requests
  10. Logging and CLI patterns
  11. AWS with boto3
  12. SSH with Paramiko
  13. Jinja2 templates
  14. Concurrency for fleet work
  15. Kubernetes client basics
  16. Project layout, packaging, and tooling
  17. Testing infrastructure code
  18. Security defaults and footguns
  19. Cheat sheet
  20. Drills
  21. Verification notes

1. Why Python and when to switch

BASH TERRITORY                          | PYTHON TERRITORY
----------------------------------------|------------------------------------------
one-liners                              | structured data parsing
simple wrappers around commands         | API clients with auth and retries
service restart / package install       | reusable logic and libraries
small cron jobs                         | tests, validation, dry-run, logging
quick grep/awk/sed                      | JSON/YAML/TOML/CSV processing
environment bootstrap                   | anything with branching + recovery

The 100-line rule

If your Bash script is over about 100 lines, ask one blunt question:

Is this still glue, or is it now logic?

Glue is fine in Bash. Logic belongs in Python.

Three signals you should switch now

  1. You are building data structures with declare -A, positional conventions, or variable name gymnastics.
  2. You are parsing structured formats, especially JSON or YAML.
  3. You need recovery, retries, fallback behavior, validation, or tests.

What Python buys you

  • Real types: integers, booleans, dicts, lists, sets, None
  • Exceptions instead of mystery exit-code soup
  • Standard library depth that replaces half your shell dependencies
  • Good third-party libraries for AWS, HTTP, SSH, Kubernetes, templates, and testing
  • Readable scripts that other humans can extend without ritual sacrifice

2. Running Python correctly

REPL

python3
>>> 2 + 2
4
>>> import json
>>> json.dumps({"ok": True})
'{"ok": true}'

Use the REPL the same way you use a scratch shell.

Scripts and shebangs

#!/usr/bin/env python3
print("hello")

Use chmod +x script.py, then run ./script.py.

python vs python3

Use this rule:

  • Outside a virtual environment: prefer python3
  • Inside an activated virtual environment: python is fine and usually preferred

Why: the python command is intentionally not uniform across Unix-like systems. It may point to Python 3, Python 2 on older systems, or not exist at all. In an active virtual environment, python should point to that environment’s interpreter.

Good shell one-liners

python3 -c 'import sys; print(sys.version)'
python3 -m json.tool < data.json
python3 -m http.server 8000
python3 -c 'import secrets; print(secrets.token_urlsafe(32))'

Virtual environments, early not late

python3 -m venv .venv
source .venv/bin/activate
python -m pip install requests

Do not learn Python by polluting the system interpreter. That is how small experiments become fossilized bad habits.


3. Core language in operator terms

Variables and types

host = "web-01"          # str
port = 443                # int
uptime_days = 17.5        # float
enabled = True            # bool
last_error = None         # nothing / null equivalent

In Bash, everything is a string until a command pretends otherwise. In Python, values carry actual types.

Explicit conversion

port = int("8080")
ratio = float("99.7")
name = str(42)

Python would rather fail loudly than quietly do nonsense. Good.

Truthiness

if not items:         # empty list, dict, set, string -> False
    print("nothing to do")

if value is None:     # check for missing value explicitly
    print("unset")

Use is None, not == None.

Control flow

if cpu > 95:
    state = "critical"
elif cpu > 80:
    state = "warning"
else:
    state = "ok"

Loops

servers = ["web-01", "web-02", "db-01"]

for server in servers:
    print(server)

for i, server in enumerate(servers, start=1):
    print(i, server)

f-strings

host = "web-03"
port = 8443
print(f"connecting to {host}:{port}")
print(f"{'HOST':<20} {'PORT':>5}")
print(f"uptime: {99.734:.1f}%")

This is what printf wished it had become after therapy.


4. Data structures that replace Bash pain

Lists

servers = ["web-01", "web-02", "db-01"]
servers.append("cache-01")
print(servers[0])
print(servers[-1])

high_ports = [p for p in [80, 443, 8080, 9090] if p > 1024]

Dicts

service_state = {
    "sshd": "running",
    "nginx": "running",
    "postgres": "failed",
}

print(service_state["sshd"])
print(service_state.get("cron", "unknown"))

for name, state in service_state.items():
    print(name, state)

Sets

seen = {"web-01", "web-02"}
if "web-01" in seen:
    print("duplicate")

Use sets for membership tests and dedupe. They are brutally useful.

Counter

from collections import Counter

counts = Counter()
with open("/var/log/syslog", encoding="utf-8", errors="replace") as f:
    for line in f:
        parts = line.split()
        if len(parts) >= 5:
            program = parts[4].split("[")[0].rstrip(":")
            counts[program] += 1

for program, n in counts.most_common(10):
    print(f"{program:<20} {n:>6}")

This replaces a depressing amount of awk | sort | uniq -c nonsense.

defaultdict

from collections import defaultdict

hosts_by_role = defaultdict(list)
hosts_by_role["web"].append("web-01")
hosts_by_role["web"].append("web-02")
hosts_by_role["db"].append("db-01")

Comprehensions

healthy = [h for h in hosts if h["status"] == "ok"]
ports = {svc["name"]: svc["port"] for svc in services}

Use comprehensions when they stay readable. If it looks like line noise, use a normal loop.


5. Functions, types, and dataclasses

Functions

def classify_load(value: float) -> str:
    if value >= 10:
        return "critical"
    if value >= 5:
        return "warning"
    return "ok"

Functions are where shell scripts stop being a haunted forest.

Type hints

Type hints do not change runtime behavior by themselves. They improve readability and let tools catch dumb mistakes early.

def ports_from_text(lines: list[str]) -> list[int]:
    result: list[int] = []
    for line in lines:
        line = line.strip()
        if line:
            result.append(int(line))
    return result

Use hints on function boundaries first. That gets most of the value with minimal ceremony.

Optional and unions

def find_host(name: str) -> dict[str, str] | None:
    ...

Dataclasses

Use a dict when the shape is loose. Use a dataclass when the shape matters.

from dataclasses import dataclass, field

@dataclass(slots=True)
class Host:
    name: str
    address: str
    port: int = 22
    tags: list[str] = field(default_factory=list)

Why this matters:

  • named fields instead of mystery dict keys
  • sane defaults
  • easy printing and testing
  • fewer typo bugs

Common bug: mutable defaults

Bad:

def add_host(name, tags=[]):
    tags.append(name)
    return tags

Good:

def add_host(name: str, tags: list[str] | None = None) -> list[str]:
    tags = [] if tags is None else tags
    tags.append(name)
    return tags

Dataclasses solve this with field(default_factory=list).


6. Files, paths, and safe writes

pathlib first

from pathlib import Path

path = Path("/etc/myapp/config.yaml")
print(path.name)         # config.yaml
print(path.suffix)       # .yaml
print(path.parent)       # /etc/myapp
print(path.exists())

Prefer pathlib over stringly-typed paths.

Reading and writing text

from pathlib import Path

config = Path("config.txt")
text = config.read_text(encoding="utf-8")
config.write_text("enabled=true\n", encoding="utf-8")

For large files, stream them instead of slurping everything into RAM.

with open("/var/log/syslog", encoding="utf-8", errors="replace") as f:
    for line in f:
        process(line)

File modes

Mode Meaning
"r" read
"w" write and truncate immediately
"a" append
"x" create only if missing
"rb" / "wb" binary read/write

Safe atomic write

If the file matters, do not write directly to the target path.

import os
import tempfile
from pathlib import Path


def atomic_write_text(path: str | Path, content: str, *, encoding: str = "utf-8") -> None:
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)

    fd, tmp_name = tempfile.mkstemp(prefix=f".{path.name}.", suffix=".tmp", dir=path.parent)
    tmp_path = Path(tmp_name)

    try:
        with os.fdopen(fd, "w", encoding=encoding) as f:
            f.write(content)
            f.flush()
            os.fsync(f.fileno())
        tmp_path.replace(path)
    except Exception:
        tmp_path.unlink(missing_ok=True)
        raise

Notes:

  • create temp file in the same filesystem as the target
  • flush and fsync() before replacement when durability matters
  • Path.replace() is the explicit “overwrite target” move

7. Subprocess and shlex

Prefer native Python when possible

If Python already has a library for the task, use it.

  • pathlib instead of ls, dirname, basename
  • json instead of jq for JSON already in your process
  • csv instead of shell splitting CSV like a maniac
  • shutil instead of cp, mv, rm in many cases

Safe subprocess pattern

import subprocess

result = subprocess.run(
    ["systemctl", "is-active", "nginx"],
    capture_output=True,
    text=True,
    check=False,
    timeout=10,
)

print(result.returncode)
print(result.stdout.strip())
print(result.stderr.strip())

Never join shell words yourself

import shlex

cmd = ["ssh", host, "sudo", "systemctl", "restart", service]
print("debug:", shlex.join(cmd))

Use shlex.join() for logging. Use list arguments for execution.

shell=True is an escape hatch, not a default

Bad:

subprocess.run(f"grep {pattern} {filename}", shell=True)

Good:

subprocess.run(["grep", pattern, filename], check=False)

Use shell=True only when you genuinely need shell syntax such as pipes, globs, redirects, or brace expansion.


8. JSON, YAML, TOML, CSV, and INI

JSON

import json

payload = json.loads('{"host": "web-01", "port": 443}')
print(payload["host"])

print(json.dumps(payload, indent=2, sort_keys=True))

YAML

import yaml

with open("config.yaml", encoding="utf-8") as f:
    data = yaml.safe_load(f)

Use safe_load(), not load().

YAML type surprises

YAML will happily interpret values in ways that surprise people. Quote ambiguous values if you care about exact strings.

enabled: true     # boolean
port: 080         # maybe not what you expected in some contexts
name: "true"     # forced string

TOML

pyproject.toml made TOML unavoidable. Learn the basics.

import tomllib

with open("pyproject.toml", "rb") as f:
    data = tomllib.load(f)

CSV

Do not parse CSV with split(','). That is how quoted commas ruin your afternoon.

import csv

with open("hosts.csv", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row["host"], row["role"])

INI / classic config files

from configparser import ConfigParser

cfg = ConfigParser()
cfg.read("app.ini")
port = cfg.getint("server", "port", fallback=8080)

9. HTTP with requests

The baseline pattern

import requests

resp = requests.get("https://example.com/health", timeout=(3.05, 10))
resp.raise_for_status()
print(resp.json())

Timeouts are mandatory

Without a timeout, your code can hang indefinitely.

Remember:

  • timeout=5 applies to both connect and read timeouts
  • timeout=(3.05, 10) splits connect and read
  • these are not full wall-clock budgets for the whole request

Sessions and retries

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry


def build_session() -> requests.Session:
    retry = Retry(
        total=5,
        connect=3,
        read=3,
        backoff_factor=0.5,
        status_forcelist=(429, 500, 502, 503, 504),
        allowed_methods=frozenset({"GET", "HEAD", "OPTIONS", "PUT", "DELETE"}),
    )
    adapter = HTTPAdapter(max_retries=retry)
    s = requests.Session()
    s.mount("http://", adapter)
    s.mount("https://", adapter)
    s.headers.update({"User-Agent": "infra-tool/1.0"})
    return s

Retry idempotent operations by default. Retrying a POST that creates money, tickets, or infrastructure can be a career event.

Authentication and secrets

import os

api_token = os.environ["API_TOKEN"]
headers = {"Authorization": f"Bearer {api_token}"}

Do not log tokens. Do not hardcode tokens. Do not stick them in git and act surprised later.


10. Logging and CLI patterns

Logging, not print() spam

import logging
import sys


def setup_logging(verbose: bool = False) -> logging.Logger:
    level = logging.DEBUG if verbose else logging.INFO
    logging.basicConfig(
        level=level,
        format="%(asctime)s %(levelname)s %(name)s %(message)s",
        handlers=[logging.StreamHandler(sys.stderr)],
    )
    return logging.getLogger("infra")

Use:

  • stdout for program output other tools may consume
  • stderr for logs, warnings, and diagnostics

JSON logging

import json
import sys
from datetime import UTC, datetime


def log_json(event: str, **fields) -> None:
    entry = {
        "event": event,
        "ts": datetime.now(UTC).isoformat(),
        **fields,
    }
    print(json.dumps(entry, sort_keys=True), file=sys.stderr)

argparse with subcommands

import argparse


def build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser(prog="infra-tool")
    sub = parser.add_subparsers(dest="command", required=True)

    check = sub.add_parser("check", help="run health checks")
    check.add_argument("--host", required=True)
    check.add_argument("--verbose", action="store_true")

    restart = sub.add_parser("restart", help="restart a service")
    restart.add_argument("--host", required=True)
    restart.add_argument("--service", required=True)
    restart.add_argument("--dry-run", action="store_true")

    return parser

Config precedence

This pattern is non-negotiable for real tools:

  1. defaults in code
  2. config file
  3. environment variables
  4. CLI arguments

The closer the input is to the current execution, the higher the precedence.

Exit codes

  • 0 success
  • non-zero failure
  • reserve stable exit codes if other automation depends on them

11. AWS with boto3

Basic client

import boto3
from botocore.exceptions import ClientError

ec2 = boto3.client("ec2", region_name="us-east-1")

Pagination is not optional

s3 = boto3.client("s3")


def iter_s3_objects(bucket: str, prefix: str = ""):
    paginator = s3.get_paginator("list_objects_v2")
    for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
        for obj in page.get("Contents", []):
            yield obj

Common pattern

def running_instances_by_tag(tag_key: str, tag_value: str) -> list[dict]:
    paginator = ec2.get_paginator("describe_instances")
    items: list[dict] = []

    for page in paginator.paginate(
        Filters=[
            {"Name": f"tag:{tag_key}", "Values": [tag_value]},
            {"Name": "instance-state-name", "Values": ["running"]},
        ]
    ):
        for reservation in page["Reservations"]:
            for instance in reservation["Instances"]:
                items.append(
                    {
                        "id": instance["InstanceId"],
                        "type": instance["InstanceType"],
                        "ip": instance.get("PrivateIpAddress"),
                    }
                )
    return items

Credentials: accurate simplified mental model

Boto3 checks several providers in order and stops at the first one that works. The commonly encountered ones are:

  1. explicit credentials passed to boto3.client()
  2. explicit credentials passed to boto3.Session()
  3. environment variables
  4. assume-role and web-identity providers
  5. IAM Identity Center provider
  6. shared credentials and config files under ~/.aws/
  7. instance or task metadata providers

The full chain is longer and evolves. The important rule is unchanged:

never hardcode credentials.

Error handling

try:
    ec2.stop_instances(InstanceIds=[instance_id])
except ClientError as e:
    code = e.response["Error"]["Code"]
    if code == "InvalidInstanceID.NotFound":
        print(f"instance {instance_id} not found")
    else:
        raise

12. SSH with Paramiko

Secure default pattern

import paramiko


def run_remote_command(host: str, user: str, key_path: str, command: str) -> dict:
    client = paramiko.SSHClient()
    client.load_system_host_keys()
    client.set_missing_host_key_policy(paramiko.RejectPolicy())

    try:
        client.connect(
            hostname=host,
            username=user,
            key_filename=key_path,
            timeout=10,
        )
        stdin, stdout, stderr = client.exec_command(command, timeout=30)
        rc = stdout.channel.recv_exit_status()
        return {
            "host": host,
            "rc": rc,
            "stdout": stdout.read().decode(errors="replace").strip(),
            "stderr": stderr.read().decode(errors="replace").strip(),
        }
    finally:
        client.close()

Lab-only shortcut

AutoAddPolicy() is convenient in throwaway labs and risky in production. It accepts unknown host keys automatically. That is trust-on-first-use with less thinking than even OpenSSH usually expects.

When Paramiko is the wrong tool

If you are fanning out to hundreds of hosts and basically reinventing Ansible, stop. You are writing the prequel to a future incident report.


13. Jinja2 templates

Use templates when generating configs or scripts from structured data.

from jinja2 import Environment, FileSystemLoader

env = Environment(
    loader=FileSystemLoader("templates"),
    trim_blocks=True,
    lstrip_blocks=True,
)

tmpl = env.get_template("nginx.conf.j2")
rendered = tmpl.render(server_name="example.com", upstreams=["10.0.0.1:8080"])

Example template:

server {
    listen 80;
    server_name {{ server_name }};

    location / {
        proxy_pass http://backend;
    }
}

upstream backend {
{% for upstream in upstreams %}
    server {{ upstream }};
{% endfor %}
}

14. Concurrency for fleet work

GIL reality

The GIL still matters for CPU-bound threads. It matters much less for typical ops work because most infrastructure automation is I/O-bound: HTTP, SSH, DNS, sockets, disk waits.

Free-threaded Python note

Modern CPython has experimental free-threaded builds starting in Python 3.13, but you should treat them as an advanced option, not your default operational assumption.

ThreadPoolExecutor

from concurrent.futures import ThreadPoolExecutor, as_completed
import requests


def check_host(host: str) -> dict:
    try:
        r = requests.get(f"http://{host}:8080/health", timeout=(2, 5))
        return {"host": host, "ok": r.ok, "status": r.status_code}
    except requests.RequestException as e:
        return {"host": host, "ok": False, "error": str(e)}


hosts = ["web-01", "web-02", "web-03"]
results = []

with ThreadPoolExecutor(max_workers=10) as pool:
    futures = [pool.submit(check_host, host) for host in hosts]
    for future in as_completed(futures):
        results.append(future.result())

Concurrency rules for operators

  • cap worker counts
  • set timeouts everywhere
  • keep operations idempotent when possible
  • distinguish retryable failures from fatal ones
  • use jitter/backoff under load
  • do not DDoS your own control plane because threading was easy

When to use what

Workload Tool
many slow network calls ThreadPoolExecutor
huge CPU-bound parsing multiprocessing
very high-concurrency async libraries already in play asyncio

For most sysadmin and cloud scripts, threads are the correct boring choice.


15. Kubernetes client basics

from kubernetes import client, config

config.load_kube_config()   # or config.load_incluster_config()
v1 = client.CoreV1Api()

pods = v1.list_namespaced_pod(
    namespace="default",
    label_selector="app=myapp",
    limit=200,
)

for pod in pods.items:
    print(pod.metadata.name, pod.status.phase)

Important scale note

list_pod_for_all_namespaces() is fine for demos and small clusters. On large clusters it is expensive. Prefer:

  • namespace scoping
  • label selectors
  • field selectors
  • chunking with limit and _continue
  • watch when you actually need a stream of updates

CrashLoopBackOff detector

def crashing_pods(namespace: str = "default") -> list[str]:
    out: list[str] = []
    resp = v1.list_namespaced_pod(namespace=namespace)
    for pod in resp.items:
        for cs in pod.status.container_statuses or []:
            waiting = cs.state.waiting
            if waiting and waiting.reason == "CrashLoopBackOff":
                out.append(pod.metadata.name)
                break
    return out

16. Project layout, packaging, and tooling

A sane small-tool layout

infra-tool/
├── pyproject.toml
├── README.md
├── src/
│   └── infra_tool/
│       ├── __init__.py
│       ├── __main__.py
│       ├── cli.py
│       ├── logging.py
│       ├── config.py
│       ├── aws.py
│       └── models.py
└── tests/
    ├── test_cli.py
    ├── test_config.py
    └── test_aws.py

pyproject.toml

[project]
name = "infra-tool"
version = "0.1.0"
description = "infrastructure automation CLI"
requires-python = ">=3.11"
dependencies = [
  "requests>=2.32",
  "boto3>=1.35",
  "PyYAML>=6.0",
  "click>=8.1",
]

[project.scripts]
infra-tool = "infra_tool.cli:main"

Virtual environments

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .

requirements.txt vs pyproject.toml

  • pyproject.toml is the modern project metadata standard
  • requirements.txt is still useful for pinned deploy or CI environments
  • do not confuse “project dependencies” with “fully locked deploy state”

pip-tools

python -m pip install pip-tools
pip-compile pyproject.toml -o requirements.txt
pip-sync requirements.txt

Good when you want human-declared dependencies and machine-generated pins.

uv

uv is a strong modern packaging tool. It is fast, useful, and worth knowing.

python -m pip install uv
uv venv .venv
source .venv/bin/activate
uv pip install -e .

Balanced view:

  • good: fast, modern, replaces several common workflows
  • not magic: its pip-compatible interface is intentionally not an exact clone of pip and pip-tools
  • practical rule: use it when it fits your repo and team, not because tool fashion demanded tribute

17. Testing infrastructure code

pytest baseline

# tests/test_config.py
from infra_tool.config import normalize_port


def test_normalize_port_accepts_string():
    assert normalize_port("443") == 443

tmp_path

def test_write_config(tmp_path):
    target = tmp_path / "config.txt"
    target.write_text("enabled=true\n", encoding="utf-8")
    assert target.read_text(encoding="utf-8") == "enabled=true\n"

monkeypatch

def test_reads_env(monkeypatch):
    monkeypatch.setenv("API_TOKEN", "test-token")
    assert get_api_token() == "test-token"

Mocking HTTP

from unittest.mock import Mock, patch


def test_health_check_ok():
    fake = Mock()
    fake.ok = True
    fake.status_code = 200
    with patch("requests.get", return_value=fake):
        result = health_check("https://example.com")
    assert result["ok"] is True

Mocking subprocess

from unittest.mock import patch
import subprocess


def test_systemctl_status():
    fake = subprocess.CompletedProcess(["systemctl"], 0, "active\n", "")
    with patch("subprocess.run", return_value=fake):
        assert is_service_active("nginx") is True

What to test first

  1. parsing and validation
  2. config precedence
  3. retry behavior without waiting in real time
  4. path and file write logic
  5. CLI argument handling
  6. error paths, not just happy paths

18. Security defaults and footguns

Secure defaults

  • set timeouts on every network call
  • verify TLS certs unless you have a real reason not to
  • reject unknown SSH host keys in production
  • avoid shell=True with untrusted input
  • never hardcode secrets
  • redact secrets from logs
  • use dry-run mode for destructive operations
  • page through APIs with pagination
  • stream large files instead of reading everything at once
  • separate operator-facing logs from machine-readable output

Footguns

1. Hardcoded credentials

Bad:

boto3.client("s3", aws_access_key_id="AKIA...", aws_secret_access_key="...")

Use environment variables, shared config, role-based credentials, or a secret store.

2. No timeout on HTTP

Bad:

requests.get(url)

Good:

requests.get(url, timeout=(3.05, 10))

3. Blind retries on non-idempotent operations

Retried POSTs can create duplicates. Know the semantics.

4. yaml.load() instead of yaml.safe_load()

Use safe_load() unless you genuinely need custom object construction.

5. Writing config files in place

Partial writes plus process crashes produce cursed half-files. Use atomic replacement.

6. Catching broad Exception and hiding context

Bad:

try:
    do_work()
except Exception:
    print("failed")

Good:

try:
    do_work()
except FileNotFoundError:
    ...
except PermissionError:
    ...

7. Unbounded concurrency

Congratulations, you parallelized your outage.

8. Logging secrets

Do not emit bearer tokens, passwords, signed URLs, session cookies, or full cloud API payloads that contain them.

9. CSV with split(',')

No.

10. Building giant dict soup instead of typed boundaries

Loose dicts are fine at the edges. Deep inside the codebase they become a swamp.


19. Cheat sheet

Bash -> Python Rosetta stone

Bash idea Python equivalent
VAR=value var = value
${var} in strings f"{var}"
arrays list
associative arrays dict
grep / awk pipelines loops, comprehensions, Counter, re
exit codes only exceptions + explicit exit codes
$(cmd) subprocess.run(...)
heredoc templates f-strings or Jinja2
jq json module
ad hoc env vars config precedence

Good defaults

Python version target: 3.11+
Run system interpreter as: python3
Run venv interpreter as: python
Paths: pathlib
HTTP: requests.Session + timeout + retries
Files: atomic replace for important writes
CLI: argparse with subcommands
Tests: pytest
Typing: annotate function boundaries first
Secrets: env vars / roles / secret store

Standard library modules worth memorizing

pathlib        paths and filesystem work
json           JSON encode/decode
csv            CSV parsing/writing
configparser   INI files
tomllib        TOML parsing
subprocess     external commands
shlex          safe shell quoting for display
collections    Counter, defaultdict, deque
datetime       time handling
logging        production logs
argparse       CLI parsing
concurrent.futures   simple threading/process pools

20. Drills

Drill 1 - Parse JSON safely

Write a function that accepts a JSON string containing a list of objects, filters for enabled=true, and returns hostnames.

Drill 2 - Atomic config update

Write a function that updates /tmp/app.conf with a rendered config string using atomic replacement.

Drill 3 - HTTP health fan-out

Given a list of hosts, use ThreadPoolExecutor and requests.Session to collect /health results with timeouts.

Drill 4 - Config precedence

Implement: defaults < file < env < CLI.

Drill 5 - Paginated AWS listing

List every object in an S3 prefix and count total size without loading all results into memory.

Drill 6 - Kubernetes CrashLoop detector

Return names of pods with any container waiting in CrashLoopBackOff, scoped to a namespace.


21. Verification notes

This revision intentionally corrected and updated a few areas that commonly go stale:

  • guidance on python vs python3
  • free-threaded Python status
  • datetime.utcnow() deprecation
  • atomic file writes using mkstemp() and explicit cleanup
  • secure Paramiko host-key handling
  • boto3 credential-provider discussion
  • requests timeout semantics
  • Kubernetes list-scaling guidance
  • TOML in the standard library
  • balanced treatment of uv

Checked against official docs on 2026-03-23

  • PEP 394 - python command guidance
  • Python docs - free-threaded Python
  • Python 3.12+ docs - datetime.utcnow() deprecation
  • Python docs - tempfile.mkstemp() and pathlib.Path.replace()
  • Requests advanced usage docs
  • Boto3 credentials docs
  • Paramiko client docs
  • Kubernetes API concepts docs
  • Ansible interpreter discovery and raw module docs
  • Astral uv docs

Final opinion

Python is not magic. It is just the point where your automation stops pretending that strings are a database, grep is a parser, and exit code 1 is “error handling”.

For infrastructure work, that is enough.