Skip to content

Scripting Rosetta — Lesson 1: Text Processing

Bundle: Bash + Python + CLI Tools + Regex Level: L1–L2 (Foundations → Operations) Time: 45–60 minutes Prerequisites: Basic terminal comfort, can edit files and run commands


What You'll Learn

By the end of this lesson you'll be able to: - Parse log files in both bash and Python - Use regex the same way in grep, sed, awk, and Python's re module - Know when to reach for a pipeline vs. a script - Port a solution from one language to the other


Part 1: The Mission

You're on call. A web server is misbehaving. You have a 50,000-line access log and you need answers fast:

  1. How many requests returned 5xx errors?
  2. Which IP addresses are hammering the server?
  3. What URLs are failing?

The log looks like this:

10.0.0.42 - - [22/Mar/2026:03:14:22 +0000] "GET /api/users HTTP/1.1" 500 1024
10.0.0.17 - - [22/Mar/2026:03:14:23 +0000] "POST /api/orders HTTP/1.1" 201 512
10.0.0.42 - - [22/Mar/2026:03:14:24 +0000] "GET /health HTTP/1.1" 200 15
10.0.0.99 - - [22/Mar/2026:03:14:25 +0000] "GET /api/users HTTP/1.1" 503 0

Let's solve each question two ways — bash pipeline first, then Python — and compare.


Part 2: Count the 5xx Errors

The Bash Way

grep -cE '" [5][0-9]{2} ' access.log

One command. Done. Let's break it down:

Piece What it does
grep -c Count matching lines instead of printing them
-E Extended regex (so {2} works without escaping)
" [5][0-9]{2} A quote, space, 5, two more digits, space — matches the status code field

Regex Sidebar: Character Classes

[0-9] matches any single digit. {2} means "exactly two of the previous thing." Together, [5][0-9]{2} matches 500, 501, 502, ... 599.

You could also write 5[0-9][0-9] without the quantifier — same result, more explicit.

The Python Way

import re

count = 0
pattern = re.compile(r'" (5\d{2}) ')

with open("access.log") as f:
    for line in f:
        if pattern.search(line):
            count += 1

print(count)

More lines, but look what we get: a compiled regex (faster for 50k lines), line-by-line reading (constant memory), and a variable we can use later.

Side-by-Side

Bash Python
Lines of code 1 8
Memory Stream (low) Stream (low)
Regex dialect ERE (POSIX) PCRE-like
Reusability Pipe it further Build on it
When to pick it Quick answer at the terminal Part of a larger script

Trivia: \d in Python regex means "any digit" — it's shorthand for [0-9]. But \d does NOT work in basic grep or sed. You need grep -P (PCRE mode) or stick with [0-9]. This is the #1 regex portability gotcha.


Part 3: Find the Top Offending IPs

The Bash Way

grep -E '" 5[0-9]{2} ' access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -10

This is the classic Unix pipeline — each tool does one job:

grep      filter to 5xx lines only
awk       extract first column (IP address)
sort      group identical lines together
uniq -c   count consecutive duplicates
sort -rn  sort by count, descending
head -10  top 10

CLI Tool Trivia: Why sort before uniq?

uniq only collapses adjacent duplicate lines. If the same IP appears on lines 1, 5, and 900, uniq sees three separate entries. sort groups them together first. This is the most common pipeline mistake beginners make.

The Python Way

import re
from collections import Counter

pattern = re.compile(r'" (5\d{2}) ')
ips = Counter()

with open("access.log") as f:
    for line in f:
        if pattern.search(line):
            ip = line.split()[0]
            ips[ip] += 1

for ip, count in ips.most_common(10):
    print(f"{count:>6}  {ip}")

Counter does what sort | uniq -c | sort -rn does — but in memory, in one pass, with no need to pre-sort.

Side-by-Side

Bash Python
Approach Stream: filter → extract → sort → count In-memory: scan → accumulate → rank
Memory O(unique IPs) at the sort stage O(unique IPs) in the Counter
Speed Spawns 5 processes, but they run in parallel via pipes Single process, pure Python loops
Readability Dense but idiomatic Verbose but self-documenting

When Bash Wins: You're already in the terminal, the data fits the pattern of "filter, extract, aggregate," and you need the answer in 10 seconds.

When Python Wins: You need to do something after counting — send an alert, update a database, generate a report, or the logic has branches and error handling.


Part 4: Extract the Failing URLs

The Bash Way

grep -oE '"[A-Z]+ [^ ]+ HTTP' access.log | grep -B0 -A0 -f <(
  grep -E '" 5[0-9]{2} ' access.log | grep -oE '"[A-Z]+ [^ ]+ HTTP'
) | sort | uniq -c | sort -rn | head -10

Getting ugly, isn't it? Here's a cleaner version with awk:

awk '$9 ~ /^5/ {print $7}' access.log | sort | uniq -c | sort -rn | head -10

awk knows that field $9 is the status code and $7 is the URL (in standard combined log format). The ~ operator does regex matching.

Awk Trivia: awk is a full programming language from 1977. It has variables, arrays, loops, functions, and printf. Most people only use '{print $2}' and never discover the rest. The name comes from its creators: Aho, Weinberger, and Kernighan (yes, the K in K&R C).

The Python Way

import re
from collections import Counter

log_pattern = re.compile(
    r'(?P<ip>\S+) \S+ \S+ \[(?P<time>[^\]]+)\] '
    r'"(?P<method>\S+) (?P<url>\S+) \S+" (?P<status>\d+) (?P<size>\d+)'
)

urls = Counter()

with open("access.log") as f:
    for line in f:
        m = log_pattern.match(line)
        if m and m.group("status").startswith("5"):
            urls[m.group("url")] += 1

for url, count in urls.most_common(10):
    print(f"{count:>6}  {url}")

Named capture groups ((?P<name>...)) make the code self-documenting. Once you have the pattern, every field is available by name — no counting columns.

Side-by-Side

Bash (awk) Python
Parsing Positional fields ($7, $9) — fragile if format changes Named groups — resilient to format changes
Regex awk's built-in ERE Python's re (PCRE-like)
Extensibility Hard to add "also group by method" Add m.group("method") to the key

Part 5: The Regex Rosetta Stone

The same patterns, three dialects. Clip this and keep it:

What you want grep (ERE) sed (ERE) awk Python re
Match a digit [0-9] [0-9] [0-9] \d or [0-9]
One or more digits [0-9]+ [0-9]+ [0-9]+ \d+
Word boundary N/A (use grep -wP) N/A N/A \b
Non-greedy match grep -P '.*?' N/A N/A .*?
Named group grep -P '(?P<n>...)' N/A N/A (?P<name>...)
Backreference grep -P '\1' \1 N/A \1
IP address (simple) [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+ same same \d+\.\d+\.\d+\.\d+

Key Rule: \d, \w, \s are PCRE shortcuts. They work in Python and grep -P. They do NOT work in plain grep -E, sed, or awk. Use [0-9], [a-zA-Z0-9_], [[:space:]] for portable regex.


Flashcard Check

Test yourself. Cover the answers and try to recall.

Q1: What does grep -c do?

Counts matching lines instead of printing them. (It does NOT count total matches per line — grep -o | wc -l does that.)

Q2: Python re.compile(r'" (5\d{2}) ') — what does the r prefix do?

Makes it a raw string so backslashes are literal. Without r, \d would be interpreted as an escape sequence by Python before regex sees it.

Q3: Why does sort | uniq -c require the sort first?

uniq only collapses adjacent duplicates. Without sorting, non-adjacent identical lines are counted separately.

Q4: awk '{print $1}' — what is $1?

The first whitespace-delimited field of the current line. $0 is the whole line. $NF is the last field. NF is the number of fields.

Q5: Bash set -e equivalent in Python?

There is no direct equivalent. Use try/except blocks, or pass check=True to subprocess.run() to make failed commands raise exceptions.

Q6: What regex matches "color" and "colour"?

colou?r — the ? makes the u optional (zero or one occurrence).

Q7: \b in Python regex matches what?

A word boundary — the position between a word character (\w) and a non-word character. \bcat\b matches "cat" but not "concatenate."

Q8: When should you stop using bash and switch to Python?

When you need complex data structures, real error handling, the script exceeds ~100 lines, or you need to test it. Bash is glue, not an application framework.


Exercises

Exercise 1: The Quick Count (bash)

Given this log line format, write a one-liner that counts how many unique IPs made GET requests:

10.0.0.42 - - [22/Mar/2026:03:14:22 +0000] "GET /api/users HTTP/1.1" 200 1024
Hint 1 Filter for GET first, then extract the IP, then count unique values.
Hint 2 grep 'GET' | awk '{print $1}' | sort -u | wc -l
Solution
grep '"GET ' access.log | awk '{print $1}' | sort -u | wc -l
Note `"GET ` with a quote — otherwise you'd match lines containing "GET" in the URL path.

Exercise 2: Port It (bash → python)

Take this bash pipeline and rewrite it as a Python script:

grep -E '" [45][0-9]{2} ' access.log | awk '{print $1, $9}' | sort | uniq -c | sort -rn | head -5

Your Python version should produce identical output.

Hint 1 Use `re` to match 4xx and 5xx status codes. Use `collections.Counter` for counting.
Hint 2 The key for the Counter should be a tuple: `(ip, status_code)`.
Solution
import re
from collections import Counter

pattern = re.compile(r'" ([45]\d{2}) ')
pairs = Counter()

with open("access.log") as f:
    for line in f:
        m = pattern.search(line)
        if m:
            ip = line.split()[0]
            status = m.group(1)
            pairs[(ip, status)] += 1

for (ip, status), count in pairs.most_common(5):
    print(f"{count:>6}  {ip} {status}")

Exercise 3: Port It (python → bash)

Rewrite this Python script as a bash one-liner:

import re
from collections import Counter

with open("access.log") as f:
    hours = Counter(
        re.search(r'\[.*/(.../\d{4}:\d{2})', line).group(1)
        for line in f
        if re.search(r'" 5\d{2} ', line)
    )

for hour, count in sorted(hours.items()):
    print(f"{hour}: {count}")

(It counts 5xx errors per hour.)

Hint 1 Extract the date/hour from the bracket field with grep -oE or awk.
Hint 2 awk can do the filtering AND the field extraction in one pass.
Solution
awk '$9 ~ /^5/ {split($4, a, ":"); print substr($4,2,12)}' access.log \
  | sort | uniq -c | sort -k2
Or more readably:
grep -E '" 5[0-9]{2} ' access.log \
  | awk -F'[\\[:]' '{print $2":"$3}' \
  | sort | uniq -c

Exercise 4: Regex Debugging

This regex is supposed to match IP addresses but has a bug. What's wrong?

\d+.\d+.\d+.\d+
Hint The dot is special in regex.
Answer The `.` matches ANY character, not just a literal dot. `1234X5678Y9012Z3456` would match. Fix: escape the dots → `\d+\.\d+\.\d+\.\d+` And remember — `\d` only works in PCRE. For grep/sed/awk portability, use: `[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+`

Exercise 5: The Decision (think, don't code)

For each task, decide: bash pipeline or Python script? Justify your choice.

  1. Find the 10 largest .log files under /var/log
  2. Parse a CSV, validate email formats, and insert valid rows into a database
  3. Tail a log file and send a Slack alert when "OOM" appears
  4. Rename 200 files from IMG_NNNN.jpg to 2026-03-22_NNNN.jpg
  5. Generate a weekly PDF report from three API endpoints
Answers 1. **Bash.** `find /var/log -name '*.log' -type f -printf '%s %p\n' | sort -rn | head -10` — one-liner, no data structures needed. 2. **Python.** CSV parsing, regex validation, database connections, error handling on each row — too much state for bash. 3. **Either, but Python is safer.** Bash: `tail -f log | grep --line-buffered OOM | while read; do curl ...; done`. Python: easier retry logic, rate limiting, structured Slack payload. 4. **Bash.** `for f in IMG_*.jpg; do mv "$f" "2026-03-22_${f#IMG_}"; done` — or even `rename` if installed. Python works too but it's overkill. 5. **Python.** Multiple HTTP requests, JSON parsing, data aggregation, PDF generation — bash would be miserable.

Cheat Sheet: Bash ↔ Python Quick Reference

Task Bash Python
Read a file while IFS= read -r line; do ...; done < file with open(f) as fh: for line in fh:
Regex match [[ "$s" =~ pattern ]] then ${BASH_REMATCH[1]} m = re.search(pattern, s) then m.group(1)
Replace text sed 's/old/new/g' file re.sub(r'old', 'new', text)
Split a string IFS=',' read -ra arr <<< "$s" s.split(',')
Associative array declare -A map; map[key]=val d = {}; d[key] = val
Sort + unique count sort \| uniq -c \| sort -rn collections.Counter(items).most_common()
Run a command Just type it subprocess.run(["cmd", "arg"], check=True)
Error handling set -euo pipefail try: ... except Exception as e:
Temp file tmp=$(mktemp) import tempfile; tmp = tempfile.NamedTemporaryFile()
JSON parsing jq '.key' file.json import json; data = json.load(f)
HTTP request curl -s URL import requests; r = requests.get(URL)
Exit code $? sys.exit(code) or subprocess return code

Key Takeaways

  1. Bash pipelines are fast for exploration. When you're poking at a log file and need an answer in 10 seconds, nothing beats grep | awk | sort | uniq -c.

  2. Python is better for production. Once you need error handling, tests, or anything beyond "filter → extract → count," switch to Python.

  3. Regex has dialects. \d works in Python and grep -P, but NOT in sed, awk, or plain grep -E. Use [0-9] when portability matters.

  4. The tools compose. The best answer is often a hybrid: bash to explore and prototype, Python to productionize. Or bash calling Python for the hard parts.

  5. Know both, choose deliberately. "I know bash so I'll bash it" and "I know Python so I'll Python it" are both wrong. Pick the tool that fits the task.


What's Next

  • Lesson 2: File Operations — find/xargs vs pathlib/os.walk, bulk rename, permission audits
  • Lesson 3: Process Management — &/wait/trap vs subprocess/asyncio/signal
  • Lesson 4: Data Wrangling — jq/cut/sort vs json/csv/pandas one-liners
  • Lesson 5: Error Handling — set -euo pipefail vs try/except, retry patterns in both