Regex & Text Wrangling - Street-Level Ops¶
Quick Diagnosis Commands¶
When you need to extract data from logs or configs right now:
# Find all unique error messages in a log
grep -i 'error' /var/log/app.log | sed 's/^.*ERROR //' | sort -u
# Count requests per HTTP status code
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn
# Show the last 100 lines that match, with timestamps
tail -10000 app.log | grep -E 'WARN|ERROR|FATAL' | tail -100
# Find which config files contain a specific setting
grep -rl 'max_connections' /etc/
# Extract and count unique IP addresses from logs
awk '{print $1}' access.log | sort -u | wc -l
Pattern: Log Parsing One-Liners¶
Apache/Nginx Access Logs¶
# Top 20 IPs by request count
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -20
# Requests per minute (for rate analysis)
awk '{print $4}' access.log | cut -d: -f1-3 | uniq -c
# All 5xx errors with full URL
awk '$9 ~ /^5/ {print $9, $7}' access.log
# Bandwidth per URL (sum bytes transferred)
awk '{bytes[$7] += $10} END {for (url in bytes) printf "%10d %s\n", bytes[url], url}' access.log | sort -rn | head -20
# Slow requests (response time > 2s, if logged as last field)
awk '$NF > 2.0 {print $NF, $7}' access.log | sort -rn | head -20
# Extract query strings
grep -oP '\?[^ "]*' access.log | sort | uniq -c | sort -rn | head -20
Syslog / journald¶
# Failed SSH login attempts with usernames
grep 'Failed password' /var/log/auth.log | awk '{print $(NF-5)}' | sort | uniq -c | sort -rn
# Sudo commands by user
grep 'sudo:' /var/log/auth.log | sed -n 's/.*sudo:\s*\(\S*\).*/\1/p' | sort | uniq -c
# Service restarts in the last hour
journalctl --since "1 hour ago" | grep -i 'started\|stopped\|restart'
Pattern: CSV Wrangling Without a Spreadsheet¶
# Print specific columns (1st and 3rd) from CSV
awk -F, '{print $1, $3}' data.csv
# Filter rows where column 3 > 100
awk -F, '$3 > 100' data.csv
# Sum column 4
awk -F, '{sum += $4} END {print sum}' data.csv
# Convert CSV to TSV
sed 's/,/\t/g' data.csv > data.tsv
# Add header to output
awk -F, 'NR==1 || $3 > 100' data.csv
# Handle quoted CSV fields (use proper tool for complex CSV)
# For simple cases without embedded commas:
awk -F, '{gsub(/"/, ""); print $2}' data.csv
# Join two CSVs on first column (files must be sorted)
join -t, file1.csv file2.csv
# Pivot: count occurrences of values in column 2
awk -F, '{count[$2]++} END {for (k in count) print k "," count[k]}' data.csv
Pattern: Config File Surgery¶
Gotcha:
sed -ion a symlink replaces the symlink with a regular file. The original file (what the symlink pointed to) is unchanged, and any other symlinks to it are now orphaned. Usesed -i.bakand verify, or dereference the link first.
# Uncomment a line
sed -i 's/^#\(max_connections = 100\)/\1/' postgresql.conf
# Comment out a line
sed -i 's/^\(listen_addresses\)/#\1/' postgresql.conf
# Change a value in key=value config
sed -i 's/^max_connections = .*/max_connections = 200/' postgresql.conf
# Add a line after a match (if not already present)
grep -q 'custom_setting' config.conf || sed -i '/\[server\]/a custom_setting = true' config.conf
# Replace between markers
sed -i '/# BEGIN MANAGED/,/# END MANAGED/{//!d}' config.conf
sed -i '/# BEGIN MANAGED/a new_line_1\nnew_line_2' config.conf
# Remove trailing whitespace from all lines
sed -i 's/[[:space:]]*$//' config.conf
# Replace in all .conf files recursively
find /etc/myapp/ -name '*.conf' -exec sed -i 's/old_host/new_host/g' {} +
Gotcha: Greedy vs Non-Greedy Matching¶
Remember: Greedy = match as much as possible. Non-greedy = match as little as possible. The mnemonic:
.*is a hungry hippo that eats everything;[^X]*is a picky eater that stops at the firstX.
# Input: <title>Hello World</title>
echo '<title>Hello World</title>' | grep -oE '<.*>'
# Output: <title>Hello World</title> (matches everything!)
# Non-greedy: use negated character class (portable)
echo '<title>Hello World</title>' | grep -oE '<[^>]+>'
# Output:
# <title>
# </title>
# Non-greedy with PCRE
echo '<title>Hello World</title>' | grep -oP '<.*?>'
# Output:
# <title>
# </title>
The .* pattern is greedy by default — it matches as much as possible. In BRE/ERE there is no non-greedy quantifier. Use [^X]* (match anything except the delimiter) instead.
Gotcha: In-Place sed Without Backup¶
# DANGEROUS: no recovery if the regex is wrong
sed -i 's/something/wrong/g' important.conf
# SAFE: creates important.conf.bak first
sed -i.bak 's/something/right/g' important.conf
# macOS sed requires an argument to -i (even empty string)
sed -i '' 's/old/new/g' file # macOS
sed -i 's/old/new/g' file # Linux
# Best practice: preview first, then apply
sed 's/old/new/g' file | diff file -
sed -i.bak 's/old/new/g' file
Gotcha: BRE vs ERE Confusion¶
This is the most common regex debugging time-sink:
# BRE (default grep/sed): special chars need escaping
grep 'error\(s\)\?' file # matches "error" or "errors"
sed 's/\([0-9]\+\)/[\1]/g' file # wrap numbers in brackets
# ERE (grep -E / sed -E): NO escaping needed
grep -E 'errors?' file # same, but readable
sed -E 's/([0-9]+)/[\1]/g' file # same, but sane
# Common mistake: using ERE syntax without -E flag
grep 'error|fatal' file # matches literal "error|fatal"
grep -E 'error|fatal' file # matches "error" OR "fatal"
Rule of thumb: always use grep -E and sed -E unless you have a specific reason not to.
Gotcha: Locale Affecting Character Ranges¶
Debug clue: If your regex works on your laptop but not in CI or on a server, the locale is the first thing to check. Run
localeon both machines. If they differ, prefix your command withLC_ALL=Cto force byte-order comparison.
# This might NOT match only uppercase letters
grep '[A-Z]' file
# In some locales, [A-Z] matches a,B,c,D... (collation order)
# Use POSIX classes for safety
grep '[[:upper:]]' file
# Or force the C locale
LC_ALL=C grep '[A-Z]' file
This bites when you deploy a script written on your laptop (locale en_US.UTF-8) to a server with a different locale. Use [[:upper:]], [[:digit:]], etc., or set LC_ALL=C.
Pattern: Multi-Line Processing¶
# Join continuation lines (lines ending with \)
sed ':a;/\\$/N;s/\\\n//;ta' file
# Extract multi-line blocks (between START and END)
sed -n '/START/,/END/p' file
# awk: print paragraph containing a pattern (blank-line separated)
awk '/pattern/{found=1} found; /^$/{found=0}' file
# Combine every 2 lines into one
paste - - < file
# Process blocks separated by blank lines
awk 'BEGIN{RS=""} /pattern/' file
Pattern: Data Transformation Pipelines¶
# Convert timestamp formats in bulk
# From: 2024-03-15T14:30:00Z To: Mar 15, 2024 14:30
sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})T([0-9]{2}:[0-9]{2}).*/\2\/\3\/\1 \4/' file
# Normalize whitespace (tabs to spaces, collapse multiple spaces)
tr '\t' ' ' < file | tr -s ' '
# Extract key-value pairs into clean format
grep -oE '[a-z_]+=[^ ]+' config.log | sort -t= -k1,1
# Convert single-line JSON to readable format (without jq)
sed 's/,/,\n/g; s/{/{\n/g; s/}/\n}/g' data.json
# Transpose rows and columns (simple case)
awk '
{
for (i=1; i<=NF; i++) {
a[NR][i] = $i
}
}
END {
for (i=1; i<=NF; i++) {
for (j=1; j<=NR; j++) {
printf "%s ", a[j][i]
}
print ""
}
}' data.txt
# Deduplicate lines preserving order (no sort required)
awk '!seen[$0]++' file
Gotcha: Word Boundaries in Different Tools¶
# grep: \b works with -P (PCRE) but not always with -E
grep -P '\berror\b' file # matches "error" not "errors"
grep -w 'error' file # portable word matching
# sed: \b is GNU extension
sed -E 's/\berror\b/ERROR/g' file # GNU sed only
# awk: use regex anchoring with field matching
awk '$0 ~ /\berror\b/' file # varies by awk version
# Portable approach: use explicit boundaries
grep -E '(^|[^a-zA-Z])error([^a-zA-Z]|$)' file
Pattern: Combining Tools for Complex Extraction¶
# Extract all email addresses, sort by domain, count
grep -oEi '[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}' dump.txt \
| awk -F@ '{print $2}' \
| sort | uniq -c | sort -rn
# Find functions defined in all Python files
grep -rn 'def [a-z_]*(' *.py | sed -E 's/.*def ([a-z_]+)\(.*/\1/' | sort -u
# Parse structured log: extract timestamp, level, message
awk -F'[][]' '{print $2}' app.log | sort | uniq -c
# Compare two directory listings (find differences)
diff <(ls dir1/ | sort) <(ls dir2/ | sort)
# Generate a report from multiple files
for f in /var/log/app/*.log; do
echo "=== $(basename $f) ==="
grep -c ERROR "$f"
done | paste - -