Python for Ops: The Bash Expert's Bridge

lesson
python
bash
automation
subprocess
pathlib
requests
argparse
logging
regex
cli-tools ---# Python for Ops — The Bash Expert's Bridge

Topics: Python, Bash, automation, subprocess, pathlib, requests, argparse, logging, regex, CLI tools Strategy: Parallel + build-up Level: L1–L2 (Foundations → Operations) Time: 75–100 minutes Prerequisites: None (but you'll get more from this if you've lived in Bash for years)

The Mission¶

You have a 200-line Bash monitoring script. It checks disk usage across a fleet, parses log files for error patterns, hits a health-check API, and emails a report. It works. It has worked for three years.

Then someone asks you to add JSON parsing, retry logic with exponential backoff, parallel execution across 50 hosts, and Slack webhook integration. You stare at the script and realize: this is where Bash stops being the right tool.

Your mission: rewrite this script in Python, piece by piece, translating every Bash idiom you already know into its Python equivalent. By the end, you'll know exactly when to reach for Python and when Bash is still the right call.

Part 1: The Rosetta Stone¶

You already know how to think about ops problems. You think in pipes, exit codes, and text streams. Python doesn't replace that thinking — it gives you better building blocks for the same ideas.

The Mental Model Map¶

Bash concept	Python equivalent	Why it's better
Pipes (`cmd1 \\| cmd2`)	Generators, itertools	Lazy evaluation, no subshell overhead
`grep` / `grep -P`	List comprehensions, `re` module	Full regex, return structured data not strings
`awk '{print $3}'`	`.split()`, dict access	Named fields instead of positional `$N`
`$ENV_VAR`	`os.environ['ENV_VAR']`	`.get()` with defaults, type conversion
Exit codes (`$?`)	Exceptions (`try/except`)	Stack traces, specific error types, recovery
`getopts` / `shift`	`argparse`	Auto-generated help, type validation, subcommands
`echo` / `logger`	`logging` module	Levels, formatters, multiple outputs
`find -name '*.log'`	`pathlib.glob()` / `rglob()`	Returns Path objects, not strings
`test -f` / `test -d`	`Path.exists()`, `Path.is_file()`	Method calls, no bracket surprises
`curl`	`requests`	Sessions, retries, JSON parsing built in
`jq`	`json` module	Native data structures, no string wrangling
`awk` associative arrays	`dict`, `collections.Counter`	First-class data structures
`while read line`	`for line in file:` / `sys.stdin`	Memory-efficient, cleaner syntax
`sed 's/old/new/g'`	`re.sub()`	Capture groups as variables, not `\1`
`mktemp` / `trap cleanup EXIT`	`with tempfile:` / context managers	Cleanup guaranteed, even on exceptions
`source config.sh`	`import`, `configparser`, `dotenv`	Namespaced, no variable pollution

Mental Model: Bash is a text stream processor. Everything is a string. Every tool communicates via text piped between processes. Python is a data structure processor. You parse text into objects once, then work with real types — lists, dicts, integers, booleans. The moment your Bash script starts doing math on strings or building data structures with associative arrays, you've crossed the line into Python territory.

Flashcard Check #1¶

Question	Answer
What's the Python equivalent of `$?` (exit code checking)?	`try/except` blocks. Python uses exceptions instead of numeric return codes. `subprocess.run(check=True)` raises `CalledProcessError` on non-zero exit.
In Bash, `grep -c ERROR log.txt` counts matches. What's the Python equivalent?	`sum(1 for line in open('log.txt') if 'ERROR' in line)` — or `len(re.findall(r'ERROR', text))` for regex.
Bash pipes are lazy (each command starts before the previous finishes). Are Python generators lazy?	Yes. A generator `(x for x in items)` yields one value at a time, just like a pipe. The downstream consumer pulls values on demand.

Part 2: "When Do I Stop Using Bash?"¶

This is the question that matters more than any syntax comparison. Here's the framework.

The Decision Line¶

                    BASH TERRITORY          |    PYTHON TERRITORY
                                            |
  One-liner file ops                        |    JSON/YAML/XML parsing
  Gluing 3–4 commands together              |    API calls with auth + retries
  Simple cron jobs                          |    Data structures beyond arrays
  Config file generation (heredocs)         |    Error handling with recovery
  Quick log tailing / grepping              |    Parallel execution
  Package install / service restart         |    Anything over ~100 lines
  Git hooks                                 |    CSV/database operations
  Simple file watching (inotifywait)        |    Unit tests
  Environment setup scripts                 |    Reusable libraries
                                            |

The 100-Line Rule¶

If your Bash script passes 100 lines, stop and ask: "Is this still glue, or is this logic?" Glue connects programs. Logic transforms data, makes decisions, handles errors. Bash is great glue. Bash is terrible logic.

War Story: A team inherited a 500-line Bash script that monitored 30 microservices. It started at 40 lines — just curl each endpoint and check the HTTP code. Then someone added JSON response parsing with jq. Then retry logic (nested loops with sleep). Then Slack notifications (more curl with JSON bodies built from string concatenation). Then parallel checks (background processes with wait). Then a config file parser.

The script had 14 instances of grep | awk | sed chains, 6 places where quoting bugs could split hostnames with spaces, and zero error handling beyond set -e (which was disabled in half the functions because subshells). It took 3 days to add a new feature because every change broke something else.

The Python rewrite was 180 lines. It had type-checked configuration, proper exception handling, concurrent.futures for parallelism, and the requests library for HTTP. The rewrite took one afternoon. The three hardest bugs in the Bash version — a quoting issue, a subshell variable scope leak, and a race condition in the background process tracking — simply could not exist in Python.

Three Signals It's Time to Switch¶

You're building data structures. The moment you reach for declare -A (associative arrays) or start simulating objects with naming conventions (host_1_ip, host_1_port), Python's dicts and classes will save you hours.
You're parsing structured data. jq is great for one-liners. But if you're piping jq output through awk to extract fields and then back into another jq call, you're writing a bad Python script in Bash.
You need error recovery, not just error detection. set -e exits on failure. That's detection. Python's try/except/finally lets you catch specific errors, retry the operation, fall back to a default, log the context, and continue. That's recovery.

Part 3: subprocess — The Escape Hatch¶

The first thing every Bash expert wants to know: "How do I run shell commands from Python?" This is your bridge. Use it to start, then gradually replace shell calls with native Python.

The Basics¶

# Bash
output=$(df -h / | tail -1 | awk '{print $5}')
echo "Disk usage: ${output}"

# Python — the subprocess way (your first step)
import subprocess

result = subprocess.run(
    ["df", "-h", "/"],
    capture_output=True,
    text=True,
    check=True,
)
# result.stdout is the full output as a string
line = result.stdout.strip().split('\n')[-1]
usage = line.split()[4]
print(f"Disk usage: {usage}")

# Python — the native way (where you want to end up)
import shutil

total, used, free = shutil.disk_usage("/")
pct = used / total * 100
print(f"Disk usage: {pct:.0f}%")

See the progression? Start with subprocess (familiar), then discover that Python has a built-in function that gives you integers instead of strings you have to parse.

The subprocess Trap¶

Gotcha: The most common mistake when Bash experts start writing Python is calling subprocess.run() for everything. If you find yourself writing subprocess.run(["grep", "ERROR", logfile]), stop. You're paying the cost of Python (startup time, indirection) without getting the benefit (data structures, error handling). Use subprocess for commands that have no Python equivalent — systemctl, iptables, mount. For everything else, use the native library.

When to use subprocess:

Use subprocess	Use native Python
`systemctl restart nginx`	Parsing a file (`open()`)
`iptables -L`	HTTP requests (`requests`)
`mount /dev/sda1 /mnt`	JSON parsing (`json`)
`docker ps`	File operations (`pathlib`)
`git log --oneline`	String matching (`re`)
`aws` CLI (when boto3 is overkill)	Math, counting, aggregation

shell=True: The Footgun¶

# NEVER DO THIS with user input
hostname = input("Enter hostname: ")
subprocess.run(f"ping -c 1 {hostname}", shell=True)  # Shell injection!

# What if hostname is "google.com; rm -rf /"?
# The shell interprets the semicolon as a command separator.

# SAFE: pass arguments as a list
subprocess.run(["ping", "-c", "1", hostname])  # No shell, no injection

The rule is simple: never use shell=True unless you need shell features (pipes, redirects, globbing) AND you control all the input.

# When you genuinely need a pipe, do it in Python
import subprocess

# Instead of: ps aux | grep python | grep -v grep
ps = subprocess.run(["ps", "aux"], capture_output=True, text=True)
python_procs = [
    line for line in ps.stdout.splitlines()
    if "python" in line and "grep" not in line
]

Part 4: pathlib — Files Without the Pain¶

In Bash, file operations are a patchwork of test, find, dirname, basename, readlink, and string manipulation. In Python, pathlib unifies all of it.

Side-by-Side: Common File Operations¶

# Bash: check if file exists and is readable
if [[ -f "/etc/nginx/nginx.conf" && -r "/etc/nginx/nginx.conf" ]]; then
    echo "Config exists and is readable"
fi

# Bash: get directory and filename
dir=$(dirname "/var/log/app/error.log")
name=$(basename "/var/log/app/error.log")
ext="${name##*.}"
stem="${name%.*}"

# Python: same operations, but they return useful objects
from pathlib import Path

config = Path("/etc/nginx/nginx.conf")
if config.is_file() and os.access(config, os.R_OK):
    print("Config exists and is readable")

log = Path("/var/log/app/error.log")
log.parent       # PosixPath('/var/log/app')   — like dirname
log.name         # 'error.log'                  — like basename
log.suffix       # '.log'                       — the extension
log.stem         # 'error'                      — name without extension

Side-by-Side: Find Files¶

# Bash: find all .log files over 100MB, modified in last 7 days
find /var/log -name "*.log" -size +100M -mtime -7

# Python: same thing, but you get Path objects you can work with
from pathlib import Path
import time

seven_days_ago = time.time() - (7 * 86400)

large_recent_logs = [
    p for p in Path("/var/log").rglob("*.log")
    if p.stat().st_size > 100 * 1024 * 1024
    and p.stat().st_mtime > seven_days_ago
]

# Now you can DO things with them — sort, sum, group by directory
total_mb = sum(p.stat().st_size for p in large_recent_logs) / (1024 * 1024)
print(f"Found {len(large_recent_logs)} files, {total_mb:.0f} MB total")

The / Operator¶

The single most satisfying thing about pathlib: building paths with /.

# No more os.path.join(os.path.join(base, "subdir"), "file.txt")
base = Path("/etc/myapp")
config = base / "conf.d" / "upstream.yaml"     # PosixPath('/etc/myapp/conf.d/upstream.yaml')
backup = config.with_suffix(".yaml.bak")        # PosixPath('/etc/myapp/conf.d/upstream.yaml.bak')

Trivia: The / operator for paths was added in Python 3.4 (2014) via PEP 428. It works by overriding __truediv__, the same dunder method that handles a / b for numbers. The pathlib module was written by Antoine Pitrou, who also rewrote Python's I/O stack. Before pathlib, the standard approach was os.path.join() — which returns a string, not a path object, so you lose all the methods and have to keep calling os.path functions.

Flashcard Check #2¶

Question	Answer
What does `Path("/var/log").rglob("*.log")` do?	Recursively finds all `.log` files under `/var/log`, like `find /var/log -name "*.log"`. Returns a generator of Path objects.
Why should you avoid `subprocess.run("cmd", shell=True)` with user input?	Shell injection. The shell interprets special characters (`;`, `\|`, `$()`) in the input string. Pass arguments as a list instead.
What's the Python equivalent of `dirname` and `basename`?	`Path.parent` and `Path.name`. Also `Path.stem` (name without extension) and `Path.suffix` (extension).

Part 5: Replacing Your Bash Toolkit¶

Let's go tool by tool through the things you use daily in Bash and show the Python equivalent. Not theory — real ops tasks.

curl → requests¶

# Bash: check a health endpoint with timeout and retry
for i in 1 2 3; do
    response=$(curl -s -o /dev/null -w "%{http_code}" \
        --connect-timeout 5 --max-time 30 \
        "http://app.internal:8080/health")
    [[ "$response" == "200" ]] && break
    sleep $((i * 2))
done

# Python: same thing, but you get JSON, headers, and proper error types
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retry = Retry(total=3, backoff_factor=2, status_forcelist=[500, 502, 503, 504])
session.mount("http://", HTTPAdapter(max_retries=retry))

response = session.get("http://app.internal:8080/health", timeout=(5, 30))
response.raise_for_status()  # Raises HTTPError for 4xx/5xx

# The response is an object, not a string
print(response.status_code)     # 200
print(response.json())          # {'status': 'healthy', 'uptime': 84600}
print(response.headers['content-type'])  # 'application/json'

The retry logic that took 4 lines of Bash (with a bug — that sleep calculation has no jitter) is two lines of Python that handles exponential backoff correctly.

jq → json¶

# Bash: extract a nested field from API response
curl -s http://api.internal/status | jq -r '.services[] | select(.healthy == false) | .name'

# Python: same query, but you can do anything with the result
import json
import requests

data = requests.get("http://api.internal/status", timeout=10).json()
unhealthy = [svc["name"] for svc in data["services"] if not svc["healthy"]]
# unhealthy is now a Python list: ['redis', 'worker-3']
# You can count, sort, deduplicate, pass to another function...

awk -F → csv / split / dict¶

# Bash: parse /etc/passwd, count shells
awk -F: '{shells[$7]++} END {for (s in shells) print shells[s], s}' /etc/passwd | sort -rn

# Python: same analysis, but readable
from collections import Counter
from pathlib import Path

shells = Counter()
for line in Path("/etc/passwd").read_text().splitlines():
    fields = line.split(":")
    shells[fields[6]] += 1

for shell, count in shells.most_common():
    print(f"{count:4d} {shell}")

Under the Hood: collections.Counter is a dict subclass optimized for counting. .most_common() returns items sorted by frequency using a heap sort — O(n log k) where k is the number of results requested. Awk's associative arrays are essentially the same data structure (hash maps), but Python gives you methods like .most_common(), .subtract(), and arithmetic: counter_a + counter_b merges counts.

grep -P → re (regex)¶

# Bash: extract IPs from a log, count occurrences
grep -oP '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}' /var/log/auth.log | sort | uniq -c | sort -rn | head -10

# Python: same thing, no pipes needed
import re
from collections import Counter
from pathlib import Path

ip_pattern = re.compile(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
ips = Counter()
for line in open("/var/log/auth.log"):
    for ip in ip_pattern.findall(line):
        ips[ip] += 1

for ip, count in ips.most_common(10):
    print(f"{count:6d} {ip}")

The Bash version spawns 5 processes. The Python version is a single process that streams the file line by line. For a 10GB auth log, the Python version uses constant memory.

getopts → argparse¶

# Bash: argument parsing
usage() { echo "Usage: $0 [-n] [-t TIMEOUT] [-o FORMAT] HOST_PATTERN" >&2; exit 1; }
DRY_RUN=false; TIMEOUT=10; FORMAT=table
while getopts ":nt:o:" opt; do
    case ${opt} in
        n) DRY_RUN=true ;; t) TIMEOUT=${OPTARG} ;; o) FORMAT=${OPTARG} ;;
        :) echo "Option -${OPTARG} requires an argument" >&2; usage ;;
        \?) echo "Unknown option -${OPTARG}" >&2; usage ;;
    esac
done
shift $((OPTIND - 1)); [[ $# -lt 1 ]] && usage
HOST_PATTERN=$1

# Python: same interface, but with free help text, type checking, and validation
import argparse

parser = argparse.ArgumentParser(description="Fleet monitoring tool")
parser.add_argument("host_pattern", help="Hostname pattern to match (e.g., 'web-*')")
parser.add_argument("-n", "--dry-run", action="store_true", help="Show what would happen")
parser.add_argument("-t", "--timeout", type=int, default=10, help="SSH timeout in seconds")
parser.add_argument("-o", "--output", choices=["table", "json", "csv"], default="table",
                    help="Output format")

args = parser.parse_args()
# args.dry_run is a bool, args.timeout is an int (not a string!)
# Run with --help and you get a formatted help page for free

Running python fleet_monitor.py --help produces:

usage: fleet_monitor.py [-h] [-n] [-t TIMEOUT] [-o {table,json,csv}] host_pattern

Fleet monitoring tool

positional arguments:
  host_pattern          Hostname pattern to match (e.g., 'web-*')

optional arguments:
  -h, --help            show this help message and exit
  -n, --dry-run         Show what would happen
  -t TIMEOUT, --timeout TIMEOUT
                        SSH timeout in seconds
  -o {table,json,csv}, --output {table,json,csv}
                        Output format

You get that for free. In Bash, you'd write the usage() function by hand and keep it in sync with the actual options (and you never do, and the help text is always out of date).

echo/logger → logging¶

# Bash: structured logging
log() {
    local level=$1; shift
    printf '%s [%s] %s\n' "$(date -u '+%Y-%m-%dT%H:%M:%SZ')" "$level" "$*" >&2
}
log INFO "Starting fleet check"
log ERROR "Host web-03 unreachable"

# Python: same idea, but levels are enforced and output is configurable
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    datefmt="%Y-%m-%dT%H:%M:%SZ",
)
log = logging.getLogger(__name__)

log.info("Starting fleet check")
log.error("Host web-03 unreachable")
log.debug("Response headers: %s", headers)  # Only prints if level=DEBUG

The Python version gives you log levels that actually filter (DEBUG messages disappear in production), the ability to add file handlers, JSON formatters, and syslog output — all without changing the log calls.

while read → sys.stdin¶

# Bash: process a pipe
cat access.log | while IFS= read -r line; do
    status=$(echo "$line" | awk '{print $9}')
    [[ "$status" =~ ^5 ]] && echo "$line"
done

# Python: same thing, from a pipe or a file
import sys

for line in sys.stdin:
    fields = line.split()
    if len(fields) >= 9 and fields[8].startswith("5"):
        print(line, end="")

Run it: cat access.log | python3 filter_5xx.py

This works because sys.stdin is iterable, just like a file object. Python reads line by line, never loading the whole file into memory.

Flashcard Check #3¶

Question	Answer
What `requests` feature replaces the Bash `for i in 1 2 3; do curl ...; sleep` retry pattern?	`urllib3.util.retry.Retry` mounted on a `requests.Session` via `HTTPAdapter`. Handles backoff, status code filtering, and connection errors automatically.
In Bash, `awk -F: '{print $7}' /etc/passwd` gets the 7th field. Python equivalent?	`line.split(":")[6]` — zero-indexed, so field 7 is index 6. Or use the `csv` module: `csv.reader(file, delimiter=":")`.
Why is `Counter.most_common(10)` better than `sort \\| uniq -c \\| sort -rn \\| head -10`?	Single process, O(n log k) heap sort vs. spawning 4 processes and sorting the entire dataset. Also returns tuples you can iterate programmatically.

Part 6: Virtual Environments — Python's Missing Piece¶

Bash doesn't have a dependency problem because Bash doesn't have dependencies. Python does. This is the one concept with no Bash equivalent, and it trips up every Bash expert who starts writing Python.

The Problem¶

# This installs requests globally — for ALL Python scripts on this machine
pip3 install requests==2.28.0

# Six months later, another script needs requests>=2.31
pip3 install requests==2.31.0

# Your first script now breaks because the API changed between versions

The Solution: venv¶

# Create a virtual environment (one-time setup)
python3 -m venv ~/.venvs/fleet-monitor

# Activate it (changes which python3 and pip3 you're using)
source ~/.venvs/fleet-monitor/bin/activate

# Now pip installs go into this venv only
pip install requests==2.31.0

# Freeze exact versions for reproducibility
pip freeze > requirements.txt

# Deactivate when done
deactivate

# Reproduce the environment on another machine
python3 -m venv ~/.venvs/fleet-monitor
source ~/.venvs/fleet-monitor/bin/activate
pip install -r requirements.txt

Trivia: virtualenv was created by Ian Bicking in 2007. The idea that each project should have isolated dependencies was controversial — sysadmins argued that system-wide packages were simpler. Python 3.3 added venv to the standard library in 2012, officially endorsing the pattern. The name "pip" is a recursive acronym: "pip installs packages." pip itself didn't exist until 2008 — before that, you used easy_install, which couldn't even uninstall packages.

Remember: "One project, one venv." Make it a reflex. The first three commands for any new Python project: mkdir project && cd project && python3 -m venv .venv && source .venv/bin/activate.

Part 7: The Rewrite — A Real Monitoring Script¶

Here's the mission payoff. We'll build a fleet monitoring script, starting with the Bash version you already understand, then rewriting it in Python section by section.

The Bash Version (abbreviated)¶

#!/usr/bin/env bash
set -euo pipefail

HOSTS_FILE="/etc/fleet/hosts.txt"
THRESHOLD=90
TIMEOUT=10
REPORT=""

# Check disk usage on each host
while IFS= read -r host; do
    [[ -z "$host" || "$host" == \#* ]] && continue
    output=$(ssh -o ConnectTimeout="$TIMEOUT" "$host" "df -h / | tail -1" 2>/dev/null) || {
        REPORT+="FAIL: $host (unreachable)\n"
        continue
    }
    usage=$(echo "$output" | awk '{print $5}' | tr -d '%')
    if (( usage > THRESHOLD )); then
        REPORT+="WARN: $host disk at ${usage}%\n"
    fi
done < "$HOSTS_FILE"

# Check health endpoint
status=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 \
    "http://app.internal:8080/health")
if [[ "$status" != "200" ]]; then
    REPORT+="CRIT: Health check returned $status\n"
fi

# Parse recent errors from log
error_count=$(grep -c "ERROR" /var/log/app/app.log 2>/dev/null || echo 0)
if (( error_count > 100 )); then
    REPORT+="WARN: $error_count errors in app.log\n"
fi

# Send report (if anything to report)
if [[ -n "$REPORT" ]]; then
    echo -e "$REPORT" | mail -s "Fleet Report $(date +%F)" ops@example.com
fi

This works. But adding Slack webhooks, JSON formatting, parallel checks, and retry logic to this script would double its size and triple its bug surface.

The Python Version¶

#!/usr/bin/env python3
"""Fleet monitoring script — checks disk, health, and logs."""

import argparse
import json
import logging
import subprocess
import sys
from collections import Counter
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# --- Configuration ---

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    datefmt="%Y-%m-%dT%H:%M:%SZ",
)
log = logging.getLogger("fleet-monitor")


def get_http_session(retries=3, backoff=1.0):
    """Create a requests session with retry logic."""
    session = requests.Session()
    retry = Retry(
        total=retries,
        backoff_factor=backoff,
        status_forcelist=[500, 502, 503, 504],
    )
    session.mount("http://", HTTPAdapter(max_retries=retry))
    session.mount("https://", HTTPAdapter(max_retries=retry))
    return session


# --- Checks ---

def check_disk(host, threshold=90, timeout=10):
    """Check disk usage on a remote host via SSH."""
    try:
        result = subprocess.run(
            ["ssh", "-o", f"ConnectTimeout={timeout}", host, "df", "-h", "/"],
            capture_output=True, text=True, timeout=timeout + 5,
        )
        if result.returncode != 0:
            return {"host": host, "status": "FAIL", "detail": "unreachable"}

        # Parse the output — last line, 5th field, strip the %
        line = result.stdout.strip().split("\n")[-1]
        usage = int(line.split()[4].rstrip("%"))

        if usage > threshold:
            return {"host": host, "status": "WARN", "detail": f"disk at {usage}%"}
        return {"host": host, "status": "OK", "detail": f"disk at {usage}%"}

    except subprocess.TimeoutExpired:
        return {"host": host, "status": "FAIL", "detail": "SSH timeout"}
    except (IndexError, ValueError) as e:
        return {"host": host, "status": "FAIL", "detail": f"parse error: {e}"}


def check_health(url, session):
    """Check an HTTP health endpoint."""
    try:
        resp = session.get(url, timeout=(5, 30))
        if resp.ok:
            return {"check": "health", "status": "OK", "detail": f"HTTP {resp.status_code}"}
        return {"check": "health", "status": "CRIT", "detail": f"HTTP {resp.status_code}"}
    except requests.RequestException as e:
        return {"check": "health", "status": "CRIT", "detail": str(e)}


def check_log_errors(log_path, threshold=100):
    """Count ERROR lines in a log file."""
    path = Path(log_path)
    if not path.exists():
        return {"check": "log_errors", "status": "WARN", "detail": f"{log_path} not found"}

    count = sum(1 for line in open(path) if "ERROR" in line)

    if count > threshold:
        return {"check": "log_errors", "status": "WARN", "detail": f"{count} errors"}
    return {"check": "log_errors", "status": "OK", "detail": f"{count} errors"}


# --- Reporting ---

def send_slack(webhook_url, results, session):
    """Send results to a Slack webhook."""
    problems = [r for r in results if r.get("status") != "OK"]
    if not problems:
        return

    blocks = [f"*{r.get('host', r.get('check'))}*: {r['status']} — {r['detail']}"
              for r in problems]
    payload = {"text": f"Fleet Report: {len(problems)} issues\n" + "\n".join(blocks)}
    session.post(webhook_url, json=payload, timeout=10)


# --- Main ---

def main():
    parser = argparse.ArgumentParser(description="Fleet monitoring")
    parser.add_argument("--hosts", default="/etc/fleet/hosts.txt", help="Hosts file")
    parser.add_argument("--health-url", default="http://app.internal:8080/health")
    parser.add_argument("--log-path", default="/var/log/app/app.log")
    parser.add_argument("--threshold", type=int, default=90, help="Disk usage threshold %%")
    parser.add_argument("--slack-webhook", help="Slack webhook URL")
    parser.add_argument("--workers", type=int, default=20, help="Parallel SSH workers")
    parser.add_argument("--output", choices=["text", "json"], default="text")
    args = parser.parse_args()

    session = get_http_session()
    results = []

    # Load hosts (skip blanks and comments, just like the Bash version)
    hosts_path = Path(args.hosts)
    if hosts_path.exists():
        hosts = [
            line.strip() for line in hosts_path.read_text().splitlines()
            if line.strip() and not line.strip().startswith("#")
        ]
    else:
        log.error("Hosts file not found: %s", args.hosts)
        sys.exit(1)

    # Parallel disk checks (this is where Python shines)
    log.info("Checking %d hosts with %d workers", len(hosts), args.workers)
    with ThreadPoolExecutor(max_workers=args.workers) as pool:
        futures = {pool.submit(check_disk, host, args.threshold): host for host in hosts}
        for future in as_completed(futures):
            results.append(future.result())

    # Health check
    results.append(check_health(args.health_url, session))

    # Log check
    results.append(check_log_errors(args.log_path))

    # Output
    problems = [r for r in results if r.get("status") != "OK"]

    if args.output == "json":
        print(json.dumps(results, indent=2))
    else:
        for r in results:
            label = r.get("host", r.get("check", "unknown"))
            print(f"[{r['status']:4s}] {label}: {r['detail']}")

    # Slack notification
    if args.slack_webhook and problems:
        send_slack(args.slack_webhook, results, session)
        log.info("Slack notification sent (%d issues)", len(problems))

    # Exit code: 0 if all OK, 1 if any problems
    sys.exit(1 if problems else 0)


if __name__ == "__main__":
    main()

What Changed¶

Concern	Bash version	Python version
Parallelism	Sequential SSH (slow)	`ThreadPoolExecutor` with 20 workers
Error handling	`\\|\\| continue` (skip and forget)	`try/except` with specific error types, context preserved
Output format	String concatenation	Structured dicts, JSON or text output
Retry logic	Not present	Built into `requests.Session`
Argument parsing	Would need hand-written `getopts`	`argparse` with types, defaults, help
Notifications	`mail` (one channel)	Slack webhook (extensible to any HTTP API)
Exit code	Implicit	Explicit: 0 for clean, 1 for problems

Part 8: CSV, TSV, and Structured Data¶

Bash engineers parse structured data with awk -F. Python has proper tools.

# Bash: sum bytes from a TSV of transfer logs
awk -F'\t' '{sum += $3} END {print sum}' transfers.tsv

# Python: same, but with headers and type safety
import csv
from pathlib import Path

total = 0
with open("transfers.tsv") as f:
    reader = csv.DictReader(f, delimiter="\t")
    for row in reader:
        total += int(row["bytes"])

print(f"Total bytes: {total:,}")

The DictReader gives you named columns instead of positional $3. When someone reorders the TSV columns, the Python version still works. The Bash version silently sums the wrong column.

Part 9: The re Module — Regex Without the Quoting Nightmares¶

Bash regex lives in grep -P, sed, [[ $str =~ pattern ]], and awk. Each has different escaping rules. Python has one: re.

import re

# Compile patterns you'll reuse (faster in loops)
log_pattern = re.compile(
    r'(?P<ip>\d+\.\d+\.\d+\.\d+) .+ \[(?P<date>[^\]]+)\] "(?P<method>\w+) (?P<path>[^ ]+)'
)

for line in open("/var/log/nginx/access.log"):
    match = log_pattern.search(line)
    if match:
        ip = match.group("ip")         # Named groups — no more \1, \2
        path = match.group("path")
        method = match.group("method")

Remember: Named capture groups (?P<name>...) are Python's gift to regex. Instead of counting parentheses to figure out if the IP is \1 or \3, you just say match.group("ip"). This alone is worth switching from grep -P for anything complex.

# re.sub replaces sed for substitution
cleaned = re.sub(r'\x1b\[[0-9;]*m', '', ansi_text)   # Strip ANSI color codes
normalized = re.sub(r'\s+', ' ', messy_text).strip()   # Collapse whitespace

Part 10: Putting It All Together — Real Tasks¶

Task: Parse a Log and Generate a Report¶

#!/usr/bin/env python3
"""Parse nginx access log, generate a status code summary."""

import re
import sys
from collections import Counter
from datetime import datetime

status_counts = Counter()
slow_requests = []

pattern = re.compile(r'"(?:GET|POST|PUT|DELETE) ([^ ]+) HTTP/\d\.\d" (\d{3}) .+ (\d+\.\d+)$')

for line in sys.stdin:
    match = pattern.search(line)
    if not match:
        continue
    path, status, response_time = match.group(1), match.group(2), float(match.group(3))
    status_counts[status] += 1
    if response_time > 2.0:
        slow_requests.append((response_time, path))

print("=== Status Code Summary ===")
for code, count in status_counts.most_common():
    print(f"  {code}: {count:,}")

print(f"\n=== Slow Requests (>{2.0}s): {len(slow_requests)} ===")
for time_s, path in sorted(slow_requests, reverse=True)[:10]:
    print(f"  {time_s:.2f}s  {path}")

Run it: cat /var/log/nginx/access.log | python3 log_report.py

It reads from stdin (like any good Unix tool), streams line by line, and produces a human-readable report. A Bash version doing the same thing would be 3x longer and require multiple awk passes.

Task: Manage Old Files¶

#!/usr/bin/env python3
"""Clean up files older than N days, with dry-run support."""

import argparse
import time
from pathlib import Path

parser = argparse.ArgumentParser(description="Clean old files")
parser.add_argument("directory", type=Path)
parser.add_argument("--days", type=int, default=30, help="Delete files older than N days")
parser.add_argument("--pattern", default="*.log", help="Glob pattern")
parser.add_argument("-n", "--dry-run", action="store_true")
args = parser.parse_args()

cutoff = time.time() - (args.days * 86400)
total_freed = 0

for path in args.directory.rglob(args.pattern):
    if path.stat().st_mtime < cutoff:
        size = path.stat().st_size
        if args.dry_run:
            print(f"Would delete: {path} ({size / 1024 / 1024:.1f} MB)")
        else:
            path.unlink()
            print(f"Deleted: {path} ({size / 1024 / 1024:.1f} MB)")
        total_freed += size

print(f"\n{'Would free' if args.dry_run else 'Freed'}: {total_freed / 1024 / 1024:.1f} MB")

Run it: python3 clean_old.py /var/log --days 7 --pattern "*.log.gz" --dry-run

Exercises¶

Exercise 1: Translate a One-Liner (2 minutes)¶

Translate this Bash one-liner to Python:

cat /etc/passwd | grep -v '^#' | cut -d: -f1,7 | sort

Solution

for line in sorted(open("/etc/passwd")):
    if not line.startswith("#"):
        fields = line.strip().split(":")
        print(f"{fields[0]}:{fields[6]}")

Exercise 2: Build a Health Checker (10 minutes)¶

Write a Python script that: 1. Reads a list of URLs from a file (one per line) 2. Checks each URL with a 5-second timeout 3. Prints [OK] or [FAIL] for each 4. Exits with code 1 if any failed

Use requests, argparse, and sys.exit().

Hint

Start with `argparse` for the file argument, read lines with `Path.read_text().splitlines()`, and use a list comprehension to collect results.

Solution

#!/usr/bin/env python3
import argparse
import sys
from pathlib import Path
import requests

parser = argparse.ArgumentParser()
parser.add_argument("urls_file", type=Path)
parser.add_argument("--timeout", type=int, default=5)
args = parser.parse_args()

urls = [u.strip() for u in args.urls_file.read_text().splitlines() if u.strip()]
failed = False

for url in urls:
    try:
        resp = requests.get(url, timeout=args.timeout)
        resp.raise_for_status()
        print(f"[  OK] {url}")
    except requests.RequestException as e:
        print(f"[FAIL] {url}: {e}")
        failed = True

sys.exit(1 if failed else 0)

Exercise 3: The Rewrite Challenge (30 minutes)¶

Take a Bash script you've written (or the one below) and rewrite it in Python. The goal isn't to make it shorter — it's to make it more maintainable.

#!/usr/bin/env bash
# Count unique user agents from nginx access log, top 20
awk -F'"' '{print $6}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

Your Python version should: - Use collections.Counter - Handle the case where the log file doesn't exist - Accept the file path and count as command-line arguments - Output in both text and JSON formats

Solution

#!/usr/bin/env python3
import argparse
import json
import re
import sys
from collections import Counter
from pathlib import Path

parser = argparse.ArgumentParser(description="Top user agents from nginx access log")
parser.add_argument("logfile", type=Path, help="Path to access log")
parser.add_argument("-n", "--top", type=int, default=20, help="Number of results")
parser.add_argument("-o", "--output", choices=["text", "json"], default="text")
args = parser.parse_args()

if not args.logfile.exists():
    print(f"Error: {args.logfile} not found", file=sys.stderr)
    sys.exit(1)

agents = Counter()
ua_pattern = re.compile(r'"[^"]*" "[^"]*" "([^"]*)"')

for line in open(args.logfile):
    match = ua_pattern.search(line)
    if match:
        agents[match.group(1)] += 1

if args.output == "json":
    print(json.dumps(dict(agents.most_common(args.top)), indent=2))
else:
    for agent, count in agents.most_common(args.top):
        print(f"{count:8d}  {agent}")

Cheat Sheet¶

Pin this to your wall. The Bash command you'd reach for, and the Python you should reach for instead.

I want to...	Bash	Python
Read a file line by line	`while IFS= read -r line`	`for line in open(path):`
Find files by pattern	`find /path -name "*.log"`	`Path("/path").rglob("*.log")`
Check if file exists	`[[ -f "$path" ]]`	`Path(path).is_file()`
Get file size	`stat -c %s "$file"`	`Path(file).stat().st_size`
Join paths	`"${dir}/${file}"`	`Path(dir) / file`
HTTP GET with timeout	`curl -s --max-time 10 "$url"`	`requests.get(url, timeout=10)`
Parse JSON	`jq '.key'`	`data = json.loads(text)` / `resp.json()`
Count occurrences	`sort \\| uniq -c`	`collections.Counter()`
Regex extract	`grep -oP 'pattern'`	`re.findall(r'pattern', text)`
Regex replace	`sed 's/old/new/g'`	`re.sub(r'old', 'new', text)`
Run a command	`$(command)`	`subprocess.run(["cmd"], capture_output=True)`
Parse CLI args	`getopts`	`argparse.ArgumentParser()`
Log with levels	`echo "..." >&2`	`logging.info("...")`
Read env var	`$VAR` / `${VAR:-default}`	`os.environ.get("VAR", "default")`
Parallel execution	`cmd & ; wait`	`concurrent.futures.ThreadPoolExecutor`
Temp file + cleanup	`mktemp` + `trap cleanup EXIT`	`with tempfile.NamedTemporaryFile():`
Read CSV/TSV	`awk -F'\t'`	`csv.reader(f, delimiter="\t")`
Create a venv	N/A (no Bash equivalent)	`python3 -m venv .venv`

Takeaways¶

Bash is for glue, Python is for logic. If your script is connecting programs, use Bash. If it's transforming data, handling errors, or calling APIs, use Python.
The 100-line rule is real. When a Bash script passes 100 lines, every new feature adds disproportionate complexity. The same script in Python scales linearly.
subprocess is a bridge, not a destination. Start by shelling out. Then replace each subprocess.run(["grep", ...]) with native Python. Your script gets faster and safer each time.
pathlib replaces a dozen Bash idioms. Path objects give you file existence checks, glob, parent/child navigation, reading, and writing — all as method calls on one object.
One project, one venv. Virtual environments are the one concept with no Bash equivalent. Make python3 -m venv .venv a reflex for every new project.
You already know the hard part. Pipes are generators. Exit codes are exceptions. awk fields are split(). The mental models transfer — Python just gives you better building blocks for the same thinking.

The Hanging Deploy — processes, signals, and subprocess behavior under the hood
Text Processing: jq, awk, sed in the Trenches — the Bash side of text wrangling in depth
Strace: Reading the Matrix — what subprocess.run is actually doing at the syscall level
Deploy a Web App from Nothing — Python scripts in a real deployment pipeline
Connection Refused — diagnostic thinking that applies whether you're in Bash or Python

Python for Ops: The Bash Expert's Bridge

The Mission¶

Part 1: The Rosetta Stone¶

The Mental Model Map¶

Flashcard Check #1¶

Part 2: "When Do I Stop Using Bash?"¶

The Decision Line¶

The 100-Line Rule¶

Three Signals It's Time to Switch¶

Part 3: subprocess — The Escape Hatch¶

The Basics¶

The subprocess Trap¶

shell=True: The Footgun¶

Part 4: pathlib — Files Without the Pain¶

Side-by-Side: Common File Operations¶

Side-by-Side: Find Files¶

The / Operator¶

Flashcard Check #2¶

Part 5: Replacing Your Bash Toolkit¶

curl → requests¶

jq → json¶

awk -F → csv / split / dict¶

grep -P → re (regex)¶

getopts → argparse¶

echo/logger → logging¶

while read → sys.stdin¶

Flashcard Check #3¶

Part 6: Virtual Environments — Python's Missing Piece¶

The Problem¶

The Solution: venv¶

Part 7: The Rewrite — A Real Monitoring Script¶

The Bash Version (abbreviated)¶

The Python Version¶

What Changed¶

Part 8: CSV, TSV, and Structured Data¶

Part 9: The re Module — Regex Without the Quoting Nightmares¶

Part 10: Putting It All Together — Real Tasks¶

Task: Parse a Log and Generate a Report¶

Task: Manage Old Files¶

Exercises¶

Exercise 1: Translate a One-Liner (2 minutes)¶

Exercise 2: Build a Health Checker (10 minutes)¶

Exercise 3: The Rewrite Challenge (30 minutes)¶

Cheat Sheet¶

Takeaways¶

Related Lessons¶

Pages that link here¶