Regex & Text Wrangling - Street-Level Ops¶

Quick Diagnosis Commands¶

When you need to extract data from logs or configs right now:

# Find all unique error messages in a log
grep -i 'error' /var/log/app.log | sed 's/^.*ERROR //' | sort -u

# Count requests per HTTP status code
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn

# Show the last 100 lines that match, with timestamps
tail -10000 app.log | grep -E 'WARN|ERROR|FATAL' | tail -100

# Find which config files contain a specific setting
grep -rl 'max_connections' /etc/

# Extract and count unique IP addresses from logs
awk '{print $1}' access.log | sort -u | wc -l

Pattern: Log Parsing One-Liners¶

Apache/Nginx Access Logs¶

# Top 20 IPs by request count
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -20

# Requests per minute (for rate analysis)
awk '{print $4}' access.log | cut -d: -f1-3 | uniq -c

# All 5xx errors with full URL
awk '$9 ~ /^5/ {print $9, $7}' access.log

# Bandwidth per URL (sum bytes transferred)
awk '{bytes[$7] += $10} END {for (url in bytes) printf "%10d %s\n", bytes[url], url}' access.log | sort -rn | head -20

# Slow requests (response time > 2s, if logged as last field)
awk '$NF > 2.0 {print $NF, $7}' access.log | sort -rn | head -20

# Extract query strings
grep -oP '\?[^ "]*' access.log | sort | uniq -c | sort -rn | head -20

Syslog / journald¶

# Failed SSH login attempts with usernames
grep 'Failed password' /var/log/auth.log | awk '{print $(NF-5)}' | sort | uniq -c | sort -rn

# Sudo commands by user
grep 'sudo:' /var/log/auth.log | sed -n 's/.*sudo:\s*\(\S*\).*/\1/p' | sort | uniq -c

# Service restarts in the last hour
journalctl --since "1 hour ago" | grep -i 'started\|stopped\|restart'

Pattern: CSV Wrangling Without a Spreadsheet¶

# Print specific columns (1st and 3rd) from CSV
awk -F, '{print $1, $3}' data.csv

# Filter rows where column 3 > 100
awk -F, '$3 > 100' data.csv

# Sum column 4
awk -F, '{sum += $4} END {print sum}' data.csv

# Convert CSV to TSV
sed 's/,/\t/g' data.csv > data.tsv

# Add header to output
awk -F, 'NR==1 || $3 > 100' data.csv

# Handle quoted CSV fields (use proper tool for complex CSV)
# For simple cases without embedded commas:
awk -F, '{gsub(/"/, ""); print $2}' data.csv

# Join two CSVs on first column (files must be sorted)
join -t, file1.csv file2.csv

# Pivot: count occurrences of values in column 2
awk -F, '{count[$2]++} END {for (k in count) print k "," count[k]}' data.csv

Pattern: Config File Surgery¶

Gotcha: sed -i on a symlink replaces the symlink with a regular file. The original file (what the symlink pointed to) is unchanged, and any other symlinks to it are now orphaned. Use sed -i.bak and verify, or dereference the link first.

# Uncomment a line
sed -i 's/^#\(max_connections = 100\)/\1/' postgresql.conf

# Comment out a line
sed -i 's/^\(listen_addresses\)/#\1/' postgresql.conf

# Change a value in key=value config
sed -i 's/^max_connections = .*/max_connections = 200/' postgresql.conf

# Add a line after a match (if not already present)
grep -q 'custom_setting' config.conf || sed -i '/\[server\]/a custom_setting = true' config.conf

# Replace between markers
sed -i '/# BEGIN MANAGED/,/# END MANAGED/{//!d}' config.conf
sed -i '/# BEGIN MANAGED/a new_line_1\nnew_line_2' config.conf

# Remove trailing whitespace from all lines
sed -i 's/[[:space:]]*$//' config.conf

# Replace in all .conf files recursively
find /etc/myapp/ -name '*.conf' -exec sed -i 's/old_host/new_host/g' {} +

Gotcha: Greedy vs Non-Greedy Matching¶

Remember: Greedy = match as much as possible. Non-greedy = match as little as possible. The mnemonic: .* is a hungry hippo that eats everything; [^X]* is a picky eater that stops at the first X.

# Input: <title>Hello World</title>
echo '<title>Hello World</title>' | grep -oE '<.*>'
# Output: <title>Hello World</title>   (matches everything!)

# Non-greedy: use negated character class (portable)
echo '<title>Hello World</title>' | grep -oE '<[^>]+>'
# Output:
# <title>
# </title>

# Non-greedy with PCRE
echo '<title>Hello World</title>' | grep -oP '<.*?>'
# Output:
# <title>
# </title>

The .* pattern is greedy by default — it matches as much as possible. In BRE/ERE there is no non-greedy quantifier. Use [^X]* (match anything except the delimiter) instead.

Gotcha: In-Place sed Without Backup¶

# DANGEROUS: no recovery if the regex is wrong
sed -i 's/something/wrong/g' important.conf

# SAFE: creates important.conf.bak first
sed -i.bak 's/something/right/g' important.conf

# macOS sed requires an argument to -i (even empty string)
sed -i '' 's/old/new/g' file        # macOS
sed -i 's/old/new/g' file           # Linux

# Best practice: preview first, then apply
sed 's/old/new/g' file | diff file -
sed -i.bak 's/old/new/g' file

Gotcha: BRE vs ERE Confusion¶

This is the most common regex debugging time-sink:

# BRE (default grep/sed): special chars need escaping
grep 'error\(s\)\?' file           # matches "error" or "errors"
sed 's/\([0-9]\+\)/[\1]/g' file   # wrap numbers in brackets

# ERE (grep -E / sed -E): NO escaping needed
grep -E 'errors?' file             # same, but readable
sed -E 's/([0-9]+)/[\1]/g' file   # same, but sane

# Common mistake: using ERE syntax without -E flag
grep 'error|fatal' file            # matches literal "error|fatal"
grep -E 'error|fatal' file         # matches "error" OR "fatal"

Rule of thumb: always use grep -E and sed -E unless you have a specific reason not to.

Gotcha: Locale Affecting Character Ranges¶

Debug clue: If your regex works on your laptop but not in CI or on a server, the locale is the first thing to check. Run locale on both machines. If they differ, prefix your command with LC_ALL=C to force byte-order comparison.

# This might NOT match only uppercase letters
grep '[A-Z]' file

# In some locales, [A-Z] matches a,B,c,D... (collation order)
# Use POSIX classes for safety
grep '[[:upper:]]' file

# Or force the C locale
LC_ALL=C grep '[A-Z]' file

This bites when you deploy a script written on your laptop (locale en_US.UTF-8) to a server with a different locale. Use [[:upper:]], [[:digit:]], etc., or set LC_ALL=C.

Pattern: Multi-Line Processing¶

# Join continuation lines (lines ending with \)
sed ':a;/\\$/N;s/\\\n//;ta' file

# Extract multi-line blocks (between START and END)
sed -n '/START/,/END/p' file

# awk: print paragraph containing a pattern (blank-line separated)
awk '/pattern/{found=1} found; /^$/{found=0}' file

# Combine every 2 lines into one
paste - - < file

# Process blocks separated by blank lines
awk 'BEGIN{RS=""} /pattern/' file

Pattern: Data Transformation Pipelines¶

# Convert timestamp formats in bulk
# From: 2024-03-15T14:30:00Z  To: Mar 15, 2024 14:30
sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})T([0-9]{2}:[0-9]{2}).*/\2\/\3\/\1 \4/' file

# Normalize whitespace (tabs to spaces, collapse multiple spaces)
tr '\t' ' ' < file | tr -s ' '

# Extract key-value pairs into clean format
grep -oE '[a-z_]+=[^ ]+' config.log | sort -t= -k1,1

# Convert single-line JSON to readable format (without jq)
sed 's/,/,\n/g; s/{/{\n/g; s/}/\n}/g' data.json

# Transpose rows and columns (simple case)
awk '
{
  for (i=1; i<=NF; i++) {
    a[NR][i] = $i
  }
}
END {
  for (i=1; i<=NF; i++) {
    for (j=1; j<=NR; j++) {
      printf "%s ", a[j][i]
    }
    print ""
  }
}' data.txt

# Deduplicate lines preserving order (no sort required)
awk '!seen[$0]++' file

Gotcha: Word Boundaries in Different Tools¶

# grep: \b works with -P (PCRE) but not always with -E
grep -P '\berror\b' file      # matches "error" not "errors"
grep -w 'error' file          # portable word matching

# sed: \b is GNU extension
sed -E 's/\berror\b/ERROR/g' file   # GNU sed only

# awk: use regex anchoring with field matching
awk '$0 ~ /\berror\b/' file         # varies by awk version

# Portable approach: use explicit boundaries
grep -E '(^|[^a-zA-Z])error([^a-zA-Z]|$)' file

Pattern: Combining Tools for Complex Extraction¶

# Extract all email addresses, sort by domain, count
grep -oEi '[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}' dump.txt \
  | awk -F@ '{print $2}' \
  | sort | uniq -c | sort -rn

# Find functions defined in all Python files
grep -rn 'def [a-z_]*(' *.py | sed -E 's/.*def ([a-z_]+)\(.*/\1/' | sort -u

# Parse structured log: extract timestamp, level, message
awk -F'[][]' '{print $2}' app.log | sort | uniq -c

# Compare two directory listings (find differences)
diff <(ls dir1/ | sort) <(ls dir2/ | sort)

# Generate a report from multiple files
for f in /var/log/app/*.log; do
  echo "=== $(basename $f) ==="
  grep -c ERROR "$f"
done | paste - -