Scripting Rosetta — Lesson 1: Text Processing¶

Bundle: Bash + Python + CLI Tools + Regex Level: L1–L2 (Foundations → Operations) Time: 45–60 minutes Prerequisites: Basic terminal comfort, can edit files and run commands

What You'll Learn¶

By the end of this lesson you'll be able to: - Parse log files in both bash and Python - Use regex the same way in grep, sed, awk, and Python's re module - Know when to reach for a pipeline vs. a script - Port a solution from one language to the other

Part 1: The Mission¶

You're on call. A web server is misbehaving. You have a 50,000-line access log and you need answers fast:

How many requests returned 5xx errors?
Which IP addresses are hammering the server?
What URLs are failing?

The log looks like this:

10.0.0.42 - - [22/Mar/2026:03:14:22 +0000] "GET /api/users HTTP/1.1" 500 1024
10.0.0.17 - - [22/Mar/2026:03:14:23 +0000] "POST /api/orders HTTP/1.1" 201 512
10.0.0.42 - - [22/Mar/2026:03:14:24 +0000] "GET /health HTTP/1.1" 200 15
10.0.0.99 - - [22/Mar/2026:03:14:25 +0000] "GET /api/users HTTP/1.1" 503 0

Let's solve each question two ways — bash pipeline first, then Python — and compare.

Part 2: Count the 5xx Errors¶

The Bash Way¶

grep -cE '" [5][0-9]{2} ' access.log

One command. Done. Let's break it down:

Piece	What it does
`grep -c`	Count matching lines instead of printing them
`-E`	Extended regex (so `{2}` works without escaping)
`" [5][0-9]{2}`	A quote, space, 5, two more digits, space — matches the status code field

Regex Sidebar: Character Classes

[0-9] matches any single digit. {2} means "exactly two of the previous thing." Together, [5][0-9]{2} matches 500, 501, 502, ... 599.

You could also write 5[0-9][0-9] without the quantifier — same result, more explicit.

The Python Way¶

import re

count = 0
pattern = re.compile(r'" (5\d{2}) ')

with open("access.log") as f:
    for line in f:
        if pattern.search(line):
            count += 1

print(count)

More lines, but look what we get: a compiled regex (faster for 50k lines), line-by-line reading (constant memory), and a variable we can use later.

Side-by-Side¶

	Bash	Python
Lines of code	1	8
Memory	Stream (low)	Stream (low)
Regex dialect	ERE (POSIX)	PCRE-like
Reusability	Pipe it further	Build on it
When to pick it	Quick answer at the terminal	Part of a larger script

Trivia: \d in Python regex means "any digit" — it's shorthand for [0-9]. But \d does NOT work in basic grep or sed. You need grep -P (PCRE mode) or stick with [0-9]. This is the #1 regex portability gotcha.

Part 3: Find the Top Offending IPs¶

The Bash Way¶

grep -E '" 5[0-9]{2} ' access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10

This is the classic Unix pipeline — each tool does one job:

grep    →  filter to 5xx lines only
awk     →  extract first column (IP address)
sort    →  group identical lines together
uniq -c →  count consecutive duplicates
sort -rn → sort by count, descending
head -10 → top 10

CLI Tool Trivia: Why sort before uniq?

uniq only collapses adjacent duplicate lines. If the same IP appears on lines 1, 5, and 900, uniq sees three separate entries. sort groups them together first. This is the most common pipeline mistake beginners make.

The Python Way¶

import re
from collections import Counter

pattern = re.compile(r'" (5\d{2}) ')
ips = Counter()

with open("access.log") as f:
    for line in f:
        if pattern.search(line):
            ip = line.split()[0]
            ips[ip] += 1

for ip, count in ips.most_common(10):
    print(f"{count:>6}  {ip}")

Counter does what sort | uniq -c | sort -rn does — but in memory, in one pass, with no need to pre-sort.

Side-by-Side¶

	Bash	Python
Approach	Stream: filter → extract → sort → count	In-memory: scan → accumulate → rank
Memory	O(unique IPs) at the `sort` stage	O(unique IPs) in the Counter
Speed	Spawns 5 processes, but they run in parallel via pipes	Single process, pure Python loops
Readability	Dense but idiomatic	Verbose but self-documenting

When Bash Wins: You're already in the terminal, the data fits the pattern of "filter, extract, aggregate," and you need the answer in 10 seconds.

When Python Wins: You need to do something after counting — send an alert, update a database, generate a report, or the logic has branches and error handling.

Part 4: Extract the Failing URLs¶

The Bash Way¶

grep -oE '"[A-Z]+ [^ ]+ HTTP' access.log | grep -B0 -A0 -f <(
  grep -E '" 5[0-9]{2} ' access.log | grep -oE '"[A-Z]+ [^ ]+ HTTP'
) | sort | uniq -c | sort -rn | head -10

Getting ugly, isn't it? Here's a cleaner version with awk:

awk '$9 ~ /^5/ {print $7}' access.log | sort | uniq -c | sort -rn | head -10

awk knows that field $9 is the status code and $7 is the URL (in standard combined log format). The ~ operator does regex matching.

Awk Trivia: awk is a full programming language from 1977. It has variables, arrays, loops, functions, and printf. Most people only use '{print $2}' and never discover the rest. The name comes from its creators: Aho, Weinberger, and Kernighan (yes, the K in K&R C).

The Python Way¶

import re
from collections import Counter

log_pattern = re.compile(
    r'(?P<ip>\S+) \S+ \S+ \[(?P<time>[^\]]+)\] '
    r'"(?P<method>\S+) (?P<url>\S+) \S+" (?P<status>\d+) (?P<size>\d+)'
)

urls = Counter()

with open("access.log") as f:
    for line in f:
        m = log_pattern.match(line)
        if m and m.group("status").startswith("5"):
            urls[m.group("url")] += 1

for url, count in urls.most_common(10):
    print(f"{count:>6}  {url}")

Named capture groups ((?P<name>...)) make the code self-documenting. Once you have the pattern, every field is available by name — no counting columns.

Side-by-Side¶

	Bash (awk)	Python
Parsing	Positional fields ($7, $9) — fragile if format changes	Named groups — resilient to format changes
Regex	awk's built-in ERE	Python's `re` (PCRE-like)
Extensibility	Hard to add "also group by method"	Add `m.group("method")` to the key

Part 5: The Regex Rosetta Stone¶

The same patterns, three dialects. Clip this and keep it:

What you want	grep (ERE)	sed (ERE)	awk	Python `re`
Match a digit	`[0-9]`	`[0-9]`	`[0-9]`	`\d` or `[0-9]`
One or more digits	`[0-9]+`	`[0-9]+`	`[0-9]+`	`\d+`
Word boundary	N/A (use `grep -wP`)	N/A	N/A	`\b`
Non-greedy match	`grep -P '.*?'`	N/A	N/A	`.*?`
Named group	`grep -P '(?P<n>...)'`	N/A	N/A	`(?P<name>...)`
Backreference	`grep -P '\1'`	`\1`	N/A	`\1`
IP address (simple)	`[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+`	same	same	`\d+\.\d+\.\d+\.\d+`

Key Rule: \d, \w, \s are PCRE shortcuts. They work in Python and grep -P. They do NOT work in plain grep -E, sed, or awk. Use [0-9], [a-zA-Z0-9_], [[:space:]] for portable regex.

Flashcard Check¶

Test yourself. Cover the answers and try to recall.

Q1: What does grep -c do?

Counts matching lines instead of printing them. (It does NOT count total matches per line — grep -o | wc -l does that.)

Q2: Python re.compile(r'" (5\d{2}) ') — what does the r prefix do?

Makes it a raw string so backslashes are literal. Without r, \d would be interpreted as an escape sequence by Python before regex sees it.

Q3: Why does sort | uniq -c require the sort first?

uniq only collapses adjacent duplicates. Without sorting, non-adjacent identical lines are counted separately.

Q4: awk '{print $1}' — what is $1?

The first whitespace-delimited field of the current line. $0 is the whole line. $NF is the last field. NF is the number of fields.

Q5: Bash set -e equivalent in Python?

There is no direct equivalent. Use try/except blocks, or pass check=True to subprocess.run() to make failed commands raise exceptions.

Q6: What regex matches "color" and "colour"?

colou?r — the ? makes the u optional (zero or one occurrence).

Q7: \b in Python regex matches what?

A word boundary — the position between a word character (\w) and a non-word character. \bcat\b matches "cat" but not "concatenate."

Q8: When should you stop using bash and switch to Python?

When you need complex data structures, real error handling, the script exceeds ~100 lines, or you need to test it. Bash is glue, not an application framework.

Exercises¶

Exercise 1: The Quick Count (bash)¶

Given this log line format, write a one-liner that counts how many unique IPs made GET requests:

10.0.0.42 - - [22/Mar/2026:03:14:22 +0000] "GET /api/users HTTP/1.1" 200 1024

Hint 1

Filter for GET first, then extract the IP, then count unique values.

Hint 2

grep 'GET' | awk '{print $1}' | sort -u | wc -l

Solution

grep '"GET ' access.log | awk '{print $1}' | sort -u | wc -l

Note `"GET ` with a quote — otherwise you'd match lines containing "GET" in the URL path.

Exercise 2: Port It (bash → python)¶

Take this bash pipeline and rewrite it as a Python script:

grep -E '" [45][0-9]{2} ' access.log | awk '{print $1, $9}' | sort | uniq -c | sort -rn | head -5

Your Python version should produce identical output.

Hint 1

Use `re` to match 4xx and 5xx status codes. Use `collections.Counter` for counting.

Hint 2

The key for the Counter should be a tuple: `(ip, status_code)`.

Solution

import re
from collections import Counter

pattern = re.compile(r'" ([45]\d{2}) ')
pairs = Counter()

with open("access.log") as f:
    for line in f:
        m = pattern.search(line)
        if m:
            ip = line.split()[0]
            status = m.group(1)
            pairs[(ip, status)] += 1

for (ip, status), count in pairs.most_common(5):
    print(f"{count:>6}  {ip} {status}")

Exercise 3: Port It (python → bash)¶

Rewrite this Python script as a bash one-liner:

import re
from collections import Counter

with open("access.log") as f:
    hours = Counter(
        re.search(r'\[.*/(.../\d{4}:\d{2})', line).group(1)
        for line in f
        if re.search(r'" 5\d{2} ', line)
    )

for hour, count in sorted(hours.items()):
    print(f"{hour}: {count}")

(It counts 5xx errors per hour.)

Hint 1

Extract the date/hour from the bracket field with grep -oE or awk.

Hint 2

awk can do the filtering AND the field extraction in one pass.

Solution

awk '$9 ~ /^5/ {split($4, a, ":"); print substr($4,2,12)}' access.log \
  | sort | uniq -c | sort -k2

Or more readably:

grep -E '" 5[0-9]{2} ' access.log \
  | awk -F'[\\[:]' '{print $2":"$3}' \
  | sort | uniq -c

Exercise 4: Regex Debugging¶

This regex is supposed to match IP addresses but has a bug. What's wrong?

\d+.\d+.\d+.\d+

Hint

The dot is special in regex.

Answer

The `.` matches ANY character, not just a literal dot. `1234X5678Y9012Z3456` would match. Fix: escape the dots → `\d+\.\d+\.\d+\.\d+` And remember — `\d` only works in PCRE. For grep/sed/awk portability, use: `[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+`

Exercise 5: The Decision (think, don't code)¶

For each task, decide: bash pipeline or Python script? Justify your choice.

Find the 10 largest .log files under /var/log
Parse a CSV, validate email formats, and insert valid rows into a database
Tail a log file and send a Slack alert when "OOM" appears
Rename 200 files from IMG_NNNN.jpg to 2026-03-22_NNNN.jpg
Generate a weekly PDF report from three API endpoints

Answers

1. **Bash.** `find /var/log -name '*.log' -type f -printf '%s %p\n' | sort -rn | head -10` — one-liner, no data structures needed. 2. **Python.** CSV parsing, regex validation, database connections, error handling on each row — too much state for bash. 3. **Either, but Python is safer.** Bash: `tail -f log | grep --line-buffered OOM | while read; do curl ...; done`. Python: easier retry logic, rate limiting, structured Slack payload. 4. **Bash.** `for f in IMG_*.jpg; do mv "$f" "2026-03-22_${f#IMG_}"; done` — or even `rename` if installed. Python works too but it's overkill. 5. **Python.** Multiple HTTP requests, JSON parsing, data aggregation, PDF generation — bash would be miserable.

Cheat Sheet: Bash ↔ Python Quick Reference¶

Task	Bash	Python
Read a file	`while IFS= read -r line; do ...; done < file`	`with open(f) as fh: for line in fh:`
Regex match	`[[ "$s" =~ pattern ]]` then `${BASH_REMATCH[1]}`	`m = re.search(pattern, s)` then `m.group(1)`
Replace text	`sed 's/old/new/g' file`	`re.sub(r'old', 'new', text)`
Split a string	`IFS=',' read -ra arr <<< "$s"`	`s.split(',')`
Associative array	`declare -A map; map[key]=val`	`d = {}; d[key] = val`
Sort + unique count	`sort \\| uniq -c \\| sort -rn`	`collections.Counter(items).most_common()`
Run a command	Just type it	`subprocess.run(["cmd", "arg"], check=True)`
Error handling	`set -euo pipefail`	`try: ... except Exception as e:`
Temp file	`tmp=$(mktemp)`	`import tempfile; tmp = tempfile.NamedTemporaryFile()`
JSON parsing	`jq '.key' file.json`	`import json; data = json.load(f)`
HTTP request	`curl -s URL`	`import requests; r = requests.get(URL)`
Exit code	`$?`	`sys.exit(code)` or `subprocess` return code

Key Takeaways¶

Bash pipelines are fast for exploration. When you're poking at a log file and need an answer in 10 seconds, nothing beats grep | awk | sort | uniq -c.
Python is better for production. Once you need error handling, tests, or anything beyond "filter → extract → count," switch to Python.
Regex has dialects. \d works in Python and grep -P, but NOT in sed, awk, or plain grep -E. Use [0-9] when portability matters.
The tools compose. The best answer is often a hybrid: bash to explore and prototype, Python to productionize. Or bash calling Python for the hard parts.
Know both, choose deliberately. "I know bash so I'll bash it" and "I know Python so I'll Python it" are both wrong. Pick the tool that fits the task.

What's Next¶

Lesson 2: File Operations — find/xargs vs pathlib/os.walk, bulk rename, permission audits
Lesson 3: Process Management — &/wait/trap vs subprocess/asyncio/signal
Lesson 4: Data Wrangling — jq/cut/sort vs json/csv/pandas one-liners
Lesson 5: Error Handling — set -euo pipefail vs try/except, retry patterns in both