Scripting Rosetta — Lesson 1: Text Processing¶
Bundle: Bash + Python + CLI Tools + Regex Level: L1–L2 (Foundations → Operations) Time: 45–60 minutes Prerequisites: Basic terminal comfort, can edit files and run commands
What You'll Learn¶
By the end of this lesson you'll be able to:
- Parse log files in both bash and Python
- Use regex the same way in grep, sed, awk, and Python's re module
- Know when to reach for a pipeline vs. a script
- Port a solution from one language to the other
Part 1: The Mission¶
You're on call. A web server is misbehaving. You have a 50,000-line access log and you need answers fast:
- How many requests returned 5xx errors?
- Which IP addresses are hammering the server?
- What URLs are failing?
The log looks like this:
10.0.0.42 - - [22/Mar/2026:03:14:22 +0000] "GET /api/users HTTP/1.1" 500 1024
10.0.0.17 - - [22/Mar/2026:03:14:23 +0000] "POST /api/orders HTTP/1.1" 201 512
10.0.0.42 - - [22/Mar/2026:03:14:24 +0000] "GET /health HTTP/1.1" 200 15
10.0.0.99 - - [22/Mar/2026:03:14:25 +0000] "GET /api/users HTTP/1.1" 503 0
Let's solve each question two ways — bash pipeline first, then Python — and compare.
Part 2: Count the 5xx Errors¶
The Bash Way¶
One command. Done. Let's break it down:
| Piece | What it does |
|---|---|
grep -c |
Count matching lines instead of printing them |
-E |
Extended regex (so {2} works without escaping) |
" [5][0-9]{2} |
A quote, space, 5, two more digits, space — matches the status code field |
Regex Sidebar: Character Classes
[0-9]matches any single digit.{2}means "exactly two of the previous thing." Together,[5][0-9]{2}matches 500, 501, 502, ... 599.You could also write
5[0-9][0-9]without the quantifier — same result, more explicit.
The Python Way¶
import re
count = 0
pattern = re.compile(r'" (5\d{2}) ')
with open("access.log") as f:
for line in f:
if pattern.search(line):
count += 1
print(count)
More lines, but look what we get: a compiled regex (faster for 50k lines), line-by-line reading (constant memory), and a variable we can use later.
Side-by-Side¶
| Bash | Python | |
|---|---|---|
| Lines of code | 1 | 8 |
| Memory | Stream (low) | Stream (low) |
| Regex dialect | ERE (POSIX) | PCRE-like |
| Reusability | Pipe it further | Build on it |
| When to pick it | Quick answer at the terminal | Part of a larger script |
Trivia:
\din Python regex means "any digit" — it's shorthand for[0-9]. But\ddoes NOT work in basicgreporsed. You needgrep -P(PCRE mode) or stick with[0-9]. This is the #1 regex portability gotcha.
Part 3: Find the Top Offending IPs¶
The Bash Way¶
This is the classic Unix pipeline — each tool does one job:
grep → filter to 5xx lines only
awk → extract first column (IP address)
sort → group identical lines together
uniq -c → count consecutive duplicates
sort -rn → sort by count, descending
head -10 → top 10
CLI Tool Trivia: Why sort before uniq?
uniqonly collapses adjacent duplicate lines. If the same IP appears on lines 1, 5, and 900,uniqsees three separate entries.sortgroups them together first. This is the most common pipeline mistake beginners make.
The Python Way¶
import re
from collections import Counter
pattern = re.compile(r'" (5\d{2}) ')
ips = Counter()
with open("access.log") as f:
for line in f:
if pattern.search(line):
ip = line.split()[0]
ips[ip] += 1
for ip, count in ips.most_common(10):
print(f"{count:>6} {ip}")
Counter does what sort | uniq -c | sort -rn does — but in memory, in one pass,
with no need to pre-sort.
Side-by-Side¶
| Bash | Python | |
|---|---|---|
| Approach | Stream: filter → extract → sort → count | In-memory: scan → accumulate → rank |
| Memory | O(unique IPs) at the sort stage |
O(unique IPs) in the Counter |
| Speed | Spawns 5 processes, but they run in parallel via pipes | Single process, pure Python loops |
| Readability | Dense but idiomatic | Verbose but self-documenting |
When Bash Wins: You're already in the terminal, the data fits the pattern of "filter, extract, aggregate," and you need the answer in 10 seconds.
When Python Wins: You need to do something after counting — send an alert, update a database, generate a report, or the logic has branches and error handling.
Part 4: Extract the Failing URLs¶
The Bash Way¶
grep -oE '"[A-Z]+ [^ ]+ HTTP' access.log | grep -B0 -A0 -f <(
grep -E '" 5[0-9]{2} ' access.log | grep -oE '"[A-Z]+ [^ ]+ HTTP'
) | sort | uniq -c | sort -rn | head -10
Getting ugly, isn't it? Here's a cleaner version with awk:
awk knows that field $9 is the status code and $7 is the URL (in standard
combined log format). The ~ operator does regex matching.
Awk Trivia:
awkis a full programming language from 1977. It has variables, arrays, loops, functions, and printf. Most people only use'{print $2}'and never discover the rest. The name comes from its creators: Aho, Weinberger, and Kernighan (yes, the K in K&R C).
The Python Way¶
import re
from collections import Counter
log_pattern = re.compile(
r'(?P<ip>\S+) \S+ \S+ \[(?P<time>[^\]]+)\] '
r'"(?P<method>\S+) (?P<url>\S+) \S+" (?P<status>\d+) (?P<size>\d+)'
)
urls = Counter()
with open("access.log") as f:
for line in f:
m = log_pattern.match(line)
if m and m.group("status").startswith("5"):
urls[m.group("url")] += 1
for url, count in urls.most_common(10):
print(f"{count:>6} {url}")
Named capture groups ((?P<name>...)) make the code self-documenting. Once you have
the pattern, every field is available by name — no counting columns.
Side-by-Side¶
| Bash (awk) | Python | |
|---|---|---|
| Parsing | Positional fields ($7, $9) — fragile if format changes | Named groups — resilient to format changes |
| Regex | awk's built-in ERE | Python's re (PCRE-like) |
| Extensibility | Hard to add "also group by method" | Add m.group("method") to the key |
Part 5: The Regex Rosetta Stone¶
The same patterns, three dialects. Clip this and keep it:
| What you want | grep (ERE) | sed (ERE) | awk | Python re |
|---|---|---|---|---|
| Match a digit | [0-9] |
[0-9] |
[0-9] |
\d or [0-9] |
| One or more digits | [0-9]+ |
[0-9]+ |
[0-9]+ |
\d+ |
| Word boundary | N/A (use grep -wP) |
N/A | N/A | \b |
| Non-greedy match | grep -P '.*?' |
N/A | N/A | .*? |
| Named group | grep -P '(?P<n>...)' |
N/A | N/A | (?P<name>...) |
| Backreference | grep -P '\1' |
\1 |
N/A | \1 |
| IP address (simple) | [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+ |
same | same | \d+\.\d+\.\d+\.\d+ |
Key Rule:
\d,\w,\sare PCRE shortcuts. They work in Python andgrep -P. They do NOT work in plaingrep -E,sed, orawk. Use[0-9],[a-zA-Z0-9_],[[:space:]]for portable regex.
Flashcard Check¶
Test yourself. Cover the answers and try to recall.
Q1: What does grep -c do?
Counts matching lines instead of printing them. (It does NOT count total matches per line —
grep -o | wc -ldoes that.)
Q2: Python re.compile(r'" (5\d{2}) ') — what does the r prefix do?
Makes it a raw string so backslashes are literal. Without
r,\dwould be interpreted as an escape sequence by Python before regex sees it.
Q3: Why does sort | uniq -c require the sort first?
uniqonly collapses adjacent duplicates. Without sorting, non-adjacent identical lines are counted separately.
Q4: awk '{print $1}' — what is $1?
The first whitespace-delimited field of the current line.
$0is the whole line.$NFis the last field.NFis the number of fields.
Q5: Bash set -e equivalent in Python?
There is no direct equivalent. Use
try/exceptblocks, or passcheck=Truetosubprocess.run()to make failed commands raise exceptions.
Q6: What regex matches "color" and "colour"?
colou?r— the?makes theuoptional (zero or one occurrence).
Q7: \b in Python regex matches what?
A word boundary — the position between a word character (
\w) and a non-word character.\bcat\bmatches "cat" but not "concatenate."
Q8: When should you stop using bash and switch to Python?
When you need complex data structures, real error handling, the script exceeds ~100 lines, or you need to test it. Bash is glue, not an application framework.
Exercises¶
Exercise 1: The Quick Count (bash)¶
Given this log line format, write a one-liner that counts how many unique IPs made GET requests:
Hint 1
Filter for GET first, then extract the IP, then count unique values.Hint 2
grep 'GET' | awk '{print $1}' | sort -u | wc -lSolution
Note `"GET ` with a quote — otherwise you'd match lines containing "GET" in the URL path.Exercise 2: Port It (bash → python)¶
Take this bash pipeline and rewrite it as a Python script:
Your Python version should produce identical output.
Hint 1
Use `re` to match 4xx and 5xx status codes. Use `collections.Counter` for counting.Hint 2
The key for the Counter should be a tuple: `(ip, status_code)`.Solution
import re
from collections import Counter
pattern = re.compile(r'" ([45]\d{2}) ')
pairs = Counter()
with open("access.log") as f:
for line in f:
m = pattern.search(line)
if m:
ip = line.split()[0]
status = m.group(1)
pairs[(ip, status)] += 1
for (ip, status), count in pairs.most_common(5):
print(f"{count:>6} {ip} {status}")
Exercise 3: Port It (python → bash)¶
Rewrite this Python script as a bash one-liner:
import re
from collections import Counter
with open("access.log") as f:
hours = Counter(
re.search(r'\[.*/(.../\d{4}:\d{2})', line).group(1)
for line in f
if re.search(r'" 5\d{2} ', line)
)
for hour, count in sorted(hours.items()):
print(f"{hour}: {count}")
(It counts 5xx errors per hour.)
Hint 1
Extract the date/hour from the bracket field with grep -oE or awk.Hint 2
awk can do the filtering AND the field extraction in one pass.Solution
Or more readably:Exercise 4: Regex Debugging¶
This regex is supposed to match IP addresses but has a bug. What's wrong?
Hint
The dot is special in regex.Answer
The `.` matches ANY character, not just a literal dot. `1234X5678Y9012Z3456` would match. Fix: escape the dots → `\d+\.\d+\.\d+\.\d+` And remember — `\d` only works in PCRE. For grep/sed/awk portability, use: `[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+`Exercise 5: The Decision (think, don't code)¶
For each task, decide: bash pipeline or Python script? Justify your choice.
- Find the 10 largest
.logfiles under/var/log - Parse a CSV, validate email formats, and insert valid rows into a database
- Tail a log file and send a Slack alert when "OOM" appears
- Rename 200 files from
IMG_NNNN.jpgto2026-03-22_NNNN.jpg - Generate a weekly PDF report from three API endpoints
Answers
1. **Bash.** `find /var/log -name '*.log' -type f -printf '%s %p\n' | sort -rn | head -10` — one-liner, no data structures needed. 2. **Python.** CSV parsing, regex validation, database connections, error handling on each row — too much state for bash. 3. **Either, but Python is safer.** Bash: `tail -f log | grep --line-buffered OOM | while read; do curl ...; done`. Python: easier retry logic, rate limiting, structured Slack payload. 4. **Bash.** `for f in IMG_*.jpg; do mv "$f" "2026-03-22_${f#IMG_}"; done` — or even `rename` if installed. Python works too but it's overkill. 5. **Python.** Multiple HTTP requests, JSON parsing, data aggregation, PDF generation — bash would be miserable.Cheat Sheet: Bash ↔ Python Quick Reference¶
| Task | Bash | Python |
|---|---|---|
| Read a file | while IFS= read -r line; do ...; done < file |
with open(f) as fh: for line in fh: |
| Regex match | [[ "$s" =~ pattern ]] then ${BASH_REMATCH[1]} |
m = re.search(pattern, s) then m.group(1) |
| Replace text | sed 's/old/new/g' file |
re.sub(r'old', 'new', text) |
| Split a string | IFS=',' read -ra arr <<< "$s" |
s.split(',') |
| Associative array | declare -A map; map[key]=val |
d = {}; d[key] = val |
| Sort + unique count | sort \| uniq -c \| sort -rn |
collections.Counter(items).most_common() |
| Run a command | Just type it | subprocess.run(["cmd", "arg"], check=True) |
| Error handling | set -euo pipefail |
try: ... except Exception as e: |
| Temp file | tmp=$(mktemp) |
import tempfile; tmp = tempfile.NamedTemporaryFile() |
| JSON parsing | jq '.key' file.json |
import json; data = json.load(f) |
| HTTP request | curl -s URL |
import requests; r = requests.get(URL) |
| Exit code | $? |
sys.exit(code) or subprocess return code |
Key Takeaways¶
-
Bash pipelines are fast for exploration. When you're poking at a log file and need an answer in 10 seconds, nothing beats
grep | awk | sort | uniq -c. -
Python is better for production. Once you need error handling, tests, or anything beyond "filter → extract → count," switch to Python.
-
Regex has dialects.
\dworks in Python andgrep -P, but NOT insed,awk, or plaingrep -E. Use[0-9]when portability matters. -
The tools compose. The best answer is often a hybrid: bash to explore and prototype, Python to productionize. Or bash calling Python for the hard parts.
-
Know both, choose deliberately. "I know bash so I'll bash it" and "I know Python so I'll Python it" are both wrong. Pick the tool that fits the task.
What's Next¶
- Lesson 2: File Operations —
find/xargsvspathlib/os.walk, bulk rename, permission audits - Lesson 3: Process Management —
&/wait/trapvssubprocess/asyncio/signal - Lesson 4: Data Wrangling —
jq/cut/sortvsjson/csv/pandas one-liners - Lesson 5: Error Handling —
set -euo pipefailvstry/except, retry patterns in both