Python for Infrastructure Automation¶
Audience: Linux, cloud, and operations engineers coming from Bash Target Python: 3.11+ Scope: Core Python, operator-grade scripting, HTTP/APIs, AWS, SSH, templating, concurrency, packaging, testing, and production footguns What this is: A practical Bash-to-Python guide What this is not: A complete language reference or a CS textbook
The mission¶
You already know how to glue systems together with Bash. That still matters. Bash is excellent for short-lived shell glue, command composition, package installs, service restarts, and tiny cron jobs.
It starts to rot when you need real data structures, JSON/YAML parsing, retries, parallel fan-out, good error handling, reusable functions, tests, or API clients. That is the decision line.
Mental model
- Bash is a text-stream processor.
- Python is a data-structure processor.
The moment your shell script starts pretending strings are records, arrays are databases, and
jq | awk | cut | sed | xargsis “application logic”, you are already writing Python badly.
Table of contents¶
- Why Python and when to switch
- Running Python correctly
- Core language in operator terms
- Data structures that replace Bash pain
- Functions, types, and dataclasses
- Files, paths, and safe writes
- Subprocess and shlex
- JSON, YAML, TOML, CSV, and INI
- HTTP with requests
- Logging and CLI patterns
- AWS with boto3
- SSH with Paramiko
- Jinja2 templates
- Concurrency for fleet work
- Kubernetes client basics
- Project layout, packaging, and tooling
- Testing infrastructure code
- Security defaults and footguns
- Cheat sheet
- Drills
- Verification notes
1. Why Python and when to switch¶
BASH TERRITORY | PYTHON TERRITORY
----------------------------------------|------------------------------------------
one-liners | structured data parsing
simple wrappers around commands | API clients with auth and retries
service restart / package install | reusable logic and libraries
small cron jobs | tests, validation, dry-run, logging
quick grep/awk/sed | JSON/YAML/TOML/CSV processing
environment bootstrap | anything with branching + recovery
The 100-line rule¶
If your Bash script is over about 100 lines, ask one blunt question:
Is this still glue, or is it now logic?
Glue is fine in Bash. Logic belongs in Python.
Three signals you should switch now¶
- You are building data structures with
declare -A, positional conventions, or variable name gymnastics. - You are parsing structured formats, especially JSON or YAML.
- You need recovery, retries, fallback behavior, validation, or tests.
What Python buys you¶
- Real types: integers, booleans, dicts, lists, sets,
None - Exceptions instead of mystery exit-code soup
- Standard library depth that replaces half your shell dependencies
- Good third-party libraries for AWS, HTTP, SSH, Kubernetes, templates, and testing
- Readable scripts that other humans can extend without ritual sacrifice
2. Running Python correctly¶
REPL¶
Use the REPL the same way you use a scratch shell.
Scripts and shebangs¶
Use chmod +x script.py, then run ./script.py.
python vs python3¶
Use this rule:
- Outside a virtual environment: prefer
python3 - Inside an activated virtual environment:
pythonis fine and usually preferred
Why: the python command is intentionally not uniform across Unix-like systems. It may point to Python 3, Python 2 on older systems, or not exist at all. In an active virtual environment, python should point to that environment’s interpreter.
Good shell one-liners¶
python3 -c 'import sys; print(sys.version)'
python3 -m json.tool < data.json
python3 -m http.server 8000
python3 -c 'import secrets; print(secrets.token_urlsafe(32))'
Virtual environments, early not late¶
Do not learn Python by polluting the system interpreter. That is how small experiments become fossilized bad habits.
3. Core language in operator terms¶
Variables and types¶
host = "web-01" # str
port = 443 # int
uptime_days = 17.5 # float
enabled = True # bool
last_error = None # nothing / null equivalent
In Bash, everything is a string until a command pretends otherwise. In Python, values carry actual types.
Explicit conversion¶
Python would rather fail loudly than quietly do nonsense. Good.
Truthiness¶
if not items: # empty list, dict, set, string -> False
print("nothing to do")
if value is None: # check for missing value explicitly
print("unset")
Use is None, not == None.
Control flow¶
Loops¶
servers = ["web-01", "web-02", "db-01"]
for server in servers:
print(server)
for i, server in enumerate(servers, start=1):
print(i, server)
f-strings¶
host = "web-03"
port = 8443
print(f"connecting to {host}:{port}")
print(f"{'HOST':<20} {'PORT':>5}")
print(f"uptime: {99.734:.1f}%")
This is what printf wished it had become after therapy.
4. Data structures that replace Bash pain¶
Lists¶
servers = ["web-01", "web-02", "db-01"]
servers.append("cache-01")
print(servers[0])
print(servers[-1])
high_ports = [p for p in [80, 443, 8080, 9090] if p > 1024]
Dicts¶
service_state = {
"sshd": "running",
"nginx": "running",
"postgres": "failed",
}
print(service_state["sshd"])
print(service_state.get("cron", "unknown"))
for name, state in service_state.items():
print(name, state)
Sets¶
Use sets for membership tests and dedupe. They are brutally useful.
Counter¶
from collections import Counter
counts = Counter()
with open("/var/log/syslog", encoding="utf-8", errors="replace") as f:
for line in f:
parts = line.split()
if len(parts) >= 5:
program = parts[4].split("[")[0].rstrip(":")
counts[program] += 1
for program, n in counts.most_common(10):
print(f"{program:<20} {n:>6}")
This replaces a depressing amount of awk | sort | uniq -c nonsense.
defaultdict¶
from collections import defaultdict
hosts_by_role = defaultdict(list)
hosts_by_role["web"].append("web-01")
hosts_by_role["web"].append("web-02")
hosts_by_role["db"].append("db-01")
Comprehensions¶
healthy = [h for h in hosts if h["status"] == "ok"]
ports = {svc["name"]: svc["port"] for svc in services}
Use comprehensions when they stay readable. If it looks like line noise, use a normal loop.
5. Functions, types, and dataclasses¶
Functions¶
def classify_load(value: float) -> str:
if value >= 10:
return "critical"
if value >= 5:
return "warning"
return "ok"
Functions are where shell scripts stop being a haunted forest.
Type hints¶
Type hints do not change runtime behavior by themselves. They improve readability and let tools catch dumb mistakes early.
def ports_from_text(lines: list[str]) -> list[int]:
result: list[int] = []
for line in lines:
line = line.strip()
if line:
result.append(int(line))
return result
Use hints on function boundaries first. That gets most of the value with minimal ceremony.
Optional and unions¶
Dataclasses¶
Use a dict when the shape is loose. Use a dataclass when the shape matters.
from dataclasses import dataclass, field
@dataclass(slots=True)
class Host:
name: str
address: str
port: int = 22
tags: list[str] = field(default_factory=list)
Why this matters:
- named fields instead of mystery dict keys
- sane defaults
- easy printing and testing
- fewer typo bugs
Common bug: mutable defaults¶
Bad:
Good:
def add_host(name: str, tags: list[str] | None = None) -> list[str]:
tags = [] if tags is None else tags
tags.append(name)
return tags
Dataclasses solve this with field(default_factory=list).
6. Files, paths, and safe writes¶
pathlib first¶
from pathlib import Path
path = Path("/etc/myapp/config.yaml")
print(path.name) # config.yaml
print(path.suffix) # .yaml
print(path.parent) # /etc/myapp
print(path.exists())
Prefer pathlib over stringly-typed paths.
Reading and writing text¶
from pathlib import Path
config = Path("config.txt")
text = config.read_text(encoding="utf-8")
config.write_text("enabled=true\n", encoding="utf-8")
For large files, stream them instead of slurping everything into RAM.
File modes¶
| Mode | Meaning |
|---|---|
"r" |
read |
"w" |
write and truncate immediately |
"a" |
append |
"x" |
create only if missing |
"rb" / "wb" |
binary read/write |
Safe atomic write¶
If the file matters, do not write directly to the target path.
import os
import tempfile
from pathlib import Path
def atomic_write_text(path: str | Path, content: str, *, encoding: str = "utf-8") -> None:
path = Path(path)
path.parent.mkdir(parents=True, exist_ok=True)
fd, tmp_name = tempfile.mkstemp(prefix=f".{path.name}.", suffix=".tmp", dir=path.parent)
tmp_path = Path(tmp_name)
try:
with os.fdopen(fd, "w", encoding=encoding) as f:
f.write(content)
f.flush()
os.fsync(f.fileno())
tmp_path.replace(path)
except Exception:
tmp_path.unlink(missing_ok=True)
raise
Notes:
- create temp file in the same filesystem as the target
- flush and
fsync()before replacement when durability matters Path.replace()is the explicit “overwrite target” move
7. Subprocess and shlex¶
Prefer native Python when possible¶
If Python already has a library for the task, use it.
pathlibinstead ofls,dirname,basenamejsoninstead ofjqfor JSON already in your processcsvinstead of shell splitting CSV like a maniacshutilinstead ofcp,mv,rmin many cases
Safe subprocess pattern¶
import subprocess
result = subprocess.run(
["systemctl", "is-active", "nginx"],
capture_output=True,
text=True,
check=False,
timeout=10,
)
print(result.returncode)
print(result.stdout.strip())
print(result.stderr.strip())
Never join shell words yourself¶
import shlex
cmd = ["ssh", host, "sudo", "systemctl", "restart", service]
print("debug:", shlex.join(cmd))
Use shlex.join() for logging. Use list arguments for execution.
shell=True is an escape hatch, not a default¶
Bad:
Good:
Use shell=True only when you genuinely need shell syntax such as pipes, globs, redirects, or brace expansion.
8. JSON, YAML, TOML, CSV, and INI¶
JSON¶
import json
payload = json.loads('{"host": "web-01", "port": 443}')
print(payload["host"])
print(json.dumps(payload, indent=2, sort_keys=True))
YAML¶
Use safe_load(), not load().
YAML type surprises¶
YAML will happily interpret values in ways that surprise people. Quote ambiguous values if you care about exact strings.
enabled: true # boolean
port: 080 # maybe not what you expected in some contexts
name: "true" # forced string
TOML¶
pyproject.toml made TOML unavoidable. Learn the basics.
CSV¶
Do not parse CSV with split(','). That is how quoted commas ruin your afternoon.
import csv
with open("hosts.csv", newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
print(row["host"], row["role"])
INI / classic config files¶
from configparser import ConfigParser
cfg = ConfigParser()
cfg.read("app.ini")
port = cfg.getint("server", "port", fallback=8080)
9. HTTP with requests¶
The baseline pattern¶
import requests
resp = requests.get("https://example.com/health", timeout=(3.05, 10))
resp.raise_for_status()
print(resp.json())
Timeouts are mandatory¶
Without a timeout, your code can hang indefinitely.
Remember:
timeout=5applies to both connect and read timeoutstimeout=(3.05, 10)splits connect and read- these are not full wall-clock budgets for the whole request
Sessions and retries¶
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def build_session() -> requests.Session:
retry = Retry(
total=5,
connect=3,
read=3,
backoff_factor=0.5,
status_forcelist=(429, 500, 502, 503, 504),
allowed_methods=frozenset({"GET", "HEAD", "OPTIONS", "PUT", "DELETE"}),
)
adapter = HTTPAdapter(max_retries=retry)
s = requests.Session()
s.mount("http://", adapter)
s.mount("https://", adapter)
s.headers.update({"User-Agent": "infra-tool/1.0"})
return s
Retry idempotent operations by default. Retrying a POST that creates money, tickets, or infrastructure can be a career event.
Authentication and secrets¶
Do not log tokens. Do not hardcode tokens. Do not stick them in git and act surprised later.
10. Logging and CLI patterns¶
Logging, not print() spam¶
import logging
import sys
def setup_logging(verbose: bool = False) -> logging.Logger:
level = logging.DEBUG if verbose else logging.INFO
logging.basicConfig(
level=level,
format="%(asctime)s %(levelname)s %(name)s %(message)s",
handlers=[logging.StreamHandler(sys.stderr)],
)
return logging.getLogger("infra")
Use:
stdoutfor program output other tools may consumestderrfor logs, warnings, and diagnostics
JSON logging¶
import json
import sys
from datetime import UTC, datetime
def log_json(event: str, **fields) -> None:
entry = {
"event": event,
"ts": datetime.now(UTC).isoformat(),
**fields,
}
print(json.dumps(entry, sort_keys=True), file=sys.stderr)
argparse with subcommands¶
import argparse
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(prog="infra-tool")
sub = parser.add_subparsers(dest="command", required=True)
check = sub.add_parser("check", help="run health checks")
check.add_argument("--host", required=True)
check.add_argument("--verbose", action="store_true")
restart = sub.add_parser("restart", help="restart a service")
restart.add_argument("--host", required=True)
restart.add_argument("--service", required=True)
restart.add_argument("--dry-run", action="store_true")
return parser
Config precedence¶
This pattern is non-negotiable for real tools:
- defaults in code
- config file
- environment variables
- CLI arguments
The closer the input is to the current execution, the higher the precedence.
Exit codes¶
0success- non-zero failure
- reserve stable exit codes if other automation depends on them
11. AWS with boto3¶
Basic client¶
import boto3
from botocore.exceptions import ClientError
ec2 = boto3.client("ec2", region_name="us-east-1")
Pagination is not optional¶
s3 = boto3.client("s3")
def iter_s3_objects(bucket: str, prefix: str = ""):
paginator = s3.get_paginator("list_objects_v2")
for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
for obj in page.get("Contents", []):
yield obj
Common pattern¶
def running_instances_by_tag(tag_key: str, tag_value: str) -> list[dict]:
paginator = ec2.get_paginator("describe_instances")
items: list[dict] = []
for page in paginator.paginate(
Filters=[
{"Name": f"tag:{tag_key}", "Values": [tag_value]},
{"Name": "instance-state-name", "Values": ["running"]},
]
):
for reservation in page["Reservations"]:
for instance in reservation["Instances"]:
items.append(
{
"id": instance["InstanceId"],
"type": instance["InstanceType"],
"ip": instance.get("PrivateIpAddress"),
}
)
return items
Credentials: accurate simplified mental model¶
Boto3 checks several providers in order and stops at the first one that works. The commonly encountered ones are:
- explicit credentials passed to
boto3.client() - explicit credentials passed to
boto3.Session() - environment variables
- assume-role and web-identity providers
- IAM Identity Center provider
- shared credentials and config files under
~/.aws/ - instance or task metadata providers
The full chain is longer and evolves. The important rule is unchanged:
never hardcode credentials.
Error handling¶
try:
ec2.stop_instances(InstanceIds=[instance_id])
except ClientError as e:
code = e.response["Error"]["Code"]
if code == "InvalidInstanceID.NotFound":
print(f"instance {instance_id} not found")
else:
raise
12. SSH with Paramiko¶
Secure default pattern¶
import paramiko
def run_remote_command(host: str, user: str, key_path: str, command: str) -> dict:
client = paramiko.SSHClient()
client.load_system_host_keys()
client.set_missing_host_key_policy(paramiko.RejectPolicy())
try:
client.connect(
hostname=host,
username=user,
key_filename=key_path,
timeout=10,
)
stdin, stdout, stderr = client.exec_command(command, timeout=30)
rc = stdout.channel.recv_exit_status()
return {
"host": host,
"rc": rc,
"stdout": stdout.read().decode(errors="replace").strip(),
"stderr": stderr.read().decode(errors="replace").strip(),
}
finally:
client.close()
Lab-only shortcut¶
AutoAddPolicy() is convenient in throwaway labs and risky in production. It accepts unknown host keys automatically. That is trust-on-first-use with less thinking than even OpenSSH usually expects.
When Paramiko is the wrong tool¶
If you are fanning out to hundreds of hosts and basically reinventing Ansible, stop. You are writing the prequel to a future incident report.
13. Jinja2 templates¶
Use templates when generating configs or scripts from structured data.
from jinja2 import Environment, FileSystemLoader
env = Environment(
loader=FileSystemLoader("templates"),
trim_blocks=True,
lstrip_blocks=True,
)
tmpl = env.get_template("nginx.conf.j2")
rendered = tmpl.render(server_name="example.com", upstreams=["10.0.0.1:8080"])
Example template:
server {
listen 80;
server_name {{ server_name }};
location / {
proxy_pass http://backend;
}
}
upstream backend {
{% for upstream in upstreams %}
server {{ upstream }};
{% endfor %}
}
14. Concurrency for fleet work¶
GIL reality¶
The GIL still matters for CPU-bound threads. It matters much less for typical ops work because most infrastructure automation is I/O-bound: HTTP, SSH, DNS, sockets, disk waits.
Free-threaded Python note¶
Modern CPython has experimental free-threaded builds starting in Python 3.13, but you should treat them as an advanced option, not your default operational assumption.
ThreadPoolExecutor¶
from concurrent.futures import ThreadPoolExecutor, as_completed
import requests
def check_host(host: str) -> dict:
try:
r = requests.get(f"http://{host}:8080/health", timeout=(2, 5))
return {"host": host, "ok": r.ok, "status": r.status_code}
except requests.RequestException as e:
return {"host": host, "ok": False, "error": str(e)}
hosts = ["web-01", "web-02", "web-03"]
results = []
with ThreadPoolExecutor(max_workers=10) as pool:
futures = [pool.submit(check_host, host) for host in hosts]
for future in as_completed(futures):
results.append(future.result())
Concurrency rules for operators¶
- cap worker counts
- set timeouts everywhere
- keep operations idempotent when possible
- distinguish retryable failures from fatal ones
- use jitter/backoff under load
- do not DDoS your own control plane because threading was easy
When to use what¶
| Workload | Tool |
|---|---|
| many slow network calls | ThreadPoolExecutor |
| huge CPU-bound parsing | multiprocessing |
| very high-concurrency async libraries already in play | asyncio |
For most sysadmin and cloud scripts, threads are the correct boring choice.
15. Kubernetes client basics¶
from kubernetes import client, config
config.load_kube_config() # or config.load_incluster_config()
v1 = client.CoreV1Api()
pods = v1.list_namespaced_pod(
namespace="default",
label_selector="app=myapp",
limit=200,
)
for pod in pods.items:
print(pod.metadata.name, pod.status.phase)
Important scale note¶
list_pod_for_all_namespaces() is fine for demos and small clusters. On large clusters it is expensive. Prefer:
- namespace scoping
- label selectors
- field selectors
- chunking with
limitand_continue watchwhen you actually need a stream of updates
CrashLoopBackOff detector¶
def crashing_pods(namespace: str = "default") -> list[str]:
out: list[str] = []
resp = v1.list_namespaced_pod(namespace=namespace)
for pod in resp.items:
for cs in pod.status.container_statuses or []:
waiting = cs.state.waiting
if waiting and waiting.reason == "CrashLoopBackOff":
out.append(pod.metadata.name)
break
return out
16. Project layout, packaging, and tooling¶
A sane small-tool layout¶
infra-tool/
├── pyproject.toml
├── README.md
├── src/
│ └── infra_tool/
│ ├── __init__.py
│ ├── __main__.py
│ ├── cli.py
│ ├── logging.py
│ ├── config.py
│ ├── aws.py
│ └── models.py
└── tests/
├── test_cli.py
├── test_config.py
└── test_aws.py
pyproject.toml¶
[project]
name = "infra-tool"
version = "0.1.0"
description = "infrastructure automation CLI"
requires-python = ">=3.11"
dependencies = [
"requests>=2.32",
"boto3>=1.35",
"PyYAML>=6.0",
"click>=8.1",
]
[project.scripts]
infra-tool = "infra_tool.cli:main"
Virtual environments¶
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .
requirements.txt vs pyproject.toml¶
pyproject.tomlis the modern project metadata standardrequirements.txtis still useful for pinned deploy or CI environments- do not confuse “project dependencies” with “fully locked deploy state”
pip-tools¶
python -m pip install pip-tools
pip-compile pyproject.toml -o requirements.txt
pip-sync requirements.txt
Good when you want human-declared dependencies and machine-generated pins.
uv¶
uv is a strong modern packaging tool. It is fast, useful, and worth knowing.
Balanced view:
- good: fast, modern, replaces several common workflows
- not magic: its pip-compatible interface is intentionally not an exact clone of
pipandpip-tools - practical rule: use it when it fits your repo and team, not because tool fashion demanded tribute
17. Testing infrastructure code¶
pytest baseline¶
# tests/test_config.py
from infra_tool.config import normalize_port
def test_normalize_port_accepts_string():
assert normalize_port("443") == 443
tmp_path¶
def test_write_config(tmp_path):
target = tmp_path / "config.txt"
target.write_text("enabled=true\n", encoding="utf-8")
assert target.read_text(encoding="utf-8") == "enabled=true\n"
monkeypatch¶
def test_reads_env(monkeypatch):
monkeypatch.setenv("API_TOKEN", "test-token")
assert get_api_token() == "test-token"
Mocking HTTP¶
from unittest.mock import Mock, patch
def test_health_check_ok():
fake = Mock()
fake.ok = True
fake.status_code = 200
with patch("requests.get", return_value=fake):
result = health_check("https://example.com")
assert result["ok"] is True
Mocking subprocess¶
from unittest.mock import patch
import subprocess
def test_systemctl_status():
fake = subprocess.CompletedProcess(["systemctl"], 0, "active\n", "")
with patch("subprocess.run", return_value=fake):
assert is_service_active("nginx") is True
What to test first¶
- parsing and validation
- config precedence
- retry behavior without waiting in real time
- path and file write logic
- CLI argument handling
- error paths, not just happy paths
18. Security defaults and footguns¶
Secure defaults¶
- set timeouts on every network call
- verify TLS certs unless you have a real reason not to
- reject unknown SSH host keys in production
- avoid
shell=Truewith untrusted input - never hardcode secrets
- redact secrets from logs
- use dry-run mode for destructive operations
- page through APIs with pagination
- stream large files instead of reading everything at once
- separate operator-facing logs from machine-readable output
Footguns¶
1. Hardcoded credentials¶
Bad:
Use environment variables, shared config, role-based credentials, or a secret store.
2. No timeout on HTTP¶
Bad:
Good:
3. Blind retries on non-idempotent operations¶
Retried POSTs can create duplicates. Know the semantics.
4. yaml.load() instead of yaml.safe_load()¶
Use safe_load() unless you genuinely need custom object construction.
5. Writing config files in place¶
Partial writes plus process crashes produce cursed half-files. Use atomic replacement.
6. Catching broad Exception and hiding context¶
Bad:
Good:
7. Unbounded concurrency¶
Congratulations, you parallelized your outage.
8. Logging secrets¶
Do not emit bearer tokens, passwords, signed URLs, session cookies, or full cloud API payloads that contain them.
9. CSV with split(',')¶
No.
10. Building giant dict soup instead of typed boundaries¶
Loose dicts are fine at the edges. Deep inside the codebase they become a swamp.
19. Cheat sheet¶
Bash -> Python Rosetta stone¶
| Bash idea | Python equivalent |
|---|---|
VAR=value |
var = value |
${var} in strings |
f"{var}" |
| arrays | list |
| associative arrays | dict |
grep / awk pipelines |
loops, comprehensions, Counter, re |
| exit codes only | exceptions + explicit exit codes |
$(cmd) |
subprocess.run(...) |
| heredoc templates | f-strings or Jinja2 |
jq |
json module |
| ad hoc env vars | config precedence |
Good defaults¶
Python version target: 3.11+
Run system interpreter as: python3
Run venv interpreter as: python
Paths: pathlib
HTTP: requests.Session + timeout + retries
Files: atomic replace for important writes
CLI: argparse with subcommands
Tests: pytest
Typing: annotate function boundaries first
Secrets: env vars / roles / secret store
Standard library modules worth memorizing¶
pathlib paths and filesystem work
json JSON encode/decode
csv CSV parsing/writing
configparser INI files
tomllib TOML parsing
subprocess external commands
shlex safe shell quoting for display
collections Counter, defaultdict, deque
datetime time handling
logging production logs
argparse CLI parsing
concurrent.futures simple threading/process pools
20. Drills¶
Drill 1 - Parse JSON safely¶
Write a function that accepts a JSON string containing a list of objects, filters for enabled=true, and returns hostnames.
Drill 2 - Atomic config update¶
Write a function that updates /tmp/app.conf with a rendered config string using atomic replacement.
Drill 3 - HTTP health fan-out¶
Given a list of hosts, use ThreadPoolExecutor and requests.Session to collect /health results with timeouts.
Drill 4 - Config precedence¶
Implement: defaults < file < env < CLI.
Drill 5 - Paginated AWS listing¶
List every object in an S3 prefix and count total size without loading all results into memory.
Drill 6 - Kubernetes CrashLoop detector¶
Return names of pods with any container waiting in CrashLoopBackOff, scoped to a namespace.
21. Verification notes¶
This revision intentionally corrected and updated a few areas that commonly go stale:
- guidance on
pythonvspython3 - free-threaded Python status
datetime.utcnow()deprecation- atomic file writes using
mkstemp()and explicit cleanup - secure Paramiko host-key handling
- boto3 credential-provider discussion
- requests timeout semantics
- Kubernetes list-scaling guidance
- TOML in the standard library
- balanced treatment of
uv
Checked against official docs on 2026-03-23¶
- PEP 394 -
pythoncommand guidance - Python docs - free-threaded Python
- Python 3.12+ docs -
datetime.utcnow()deprecation - Python docs -
tempfile.mkstemp()andpathlib.Path.replace() - Requests advanced usage docs
- Boto3 credentials docs
- Paramiko client docs
- Kubernetes API concepts docs
- Ansible interpreter discovery and
rawmodule docs - Astral
uvdocs
Final opinion¶
Python is not magic. It is just the point where your automation stops pretending that strings are a database, grep is a parser, and exit code 1 is “error handling”.
For infrastructure work, that is enough.