Skip to content

Portal | Level: L1: Foundations | Topics: Regex & Text Wrangling, Bash / Shell Scripting, Linux Fundamentals | Domain: CLI Tools

Regex & Text Wrangling - Primer

Why This Matters

Every infrastructure task eventually becomes a text-processing task. Logs are text. Configs are text. API responses are text. If you cannot slice, dice, and transform text fluently from the command line, you are copying things into spreadsheets — and that does not scale.

This primer covers regex syntax (and its three annoying dialects), sed for stream editing, awk as a full data-processing language, and grep for searching. Master these and you can extract answers from any log file, rewrite any config, and build data pipelines without writing a script.


Regex Fundamentals

The Three Dialects You Must Know

Feature BRE (Basic) ERE (Extended) PCRE (Perl)
Used by grep, sed grep -E, awk grep -P, Perl
Grouping \( \) ( ) ( )
Alternation \| \| \|
Quantifier + \+ + +
Quantifier ? \? ? ?
Quantifier {n,m} \{n,m\} {n,m} {n,m}
Lookahead No No (?=...), (?!...)
Non-greedy No No *?, +?

The escaping differences between BRE and ERE cause more wasted hours than any regex complexity. When in doubt, use ERE (grep -E or sed -E).

Name origin: "Regular expressions" were invented by mathematician Stephen Kleene in 1951 as a notation for describing "regular languages" in formal language theory. Ken Thompson implemented them in the QED text editor in 1968 and then in ed and grep on Unix. The * (Kleene star) is named after Kleene.

Core Syntax Reference

.        any single character
^        start of line
$        end of line
*        zero or more of preceding
+        one or more of preceding (ERE/PCRE)
?        zero or one of preceding (ERE/PCRE)
[abc]    character class: a, b, or c
[^abc]   negated class: not a, b, or c
[a-z]    range: lowercase letters
\d       digit (PCRE only; use [0-9] for portability)
\w       word character (PCRE; use [a-zA-Z0-9_] for portability)
\s       whitespace (PCRE; use [[:space:]] for portability)
\b       word boundary (PCRE)
(...)    capture group (ERE/PCRE)
\1       backreference to first capture group
|        alternation (ERE/PCRE)

POSIX Character Classes (Portable)

[[:digit:]]   = [0-9]
[[:alpha:]]   = [a-zA-Z]
[[:alnum:]]   = [a-zA-Z0-9]
[[:space:]]   = whitespace (space, tab, newline)
[[:upper:]]   = [A-Z]
[[:lower:]]   = [a-z]
[[:punct:]]   = punctuation

Use these in sed and awk for locale-safe matching.


grep: Search and Filter

Essential Flags

# Extended regex (ERE) — avoids backslash hell
grep -E 'pattern' file

# Case insensitive
grep -i 'error' /var/log/syslog

# Line numbers
grep -n 'FATAL' app.log

# Count matches
grep -c 'error' app.log

# Invert match (lines that do NOT match)
grep -v '^#' config.conf

# Show context (3 lines before, 3 after)
grep -B3 -A3 'OOM' /var/log/kern.log

# Recursive search in directory
grep -r 'password' /etc/

# Only filenames
grep -rl 'TODO' src/

# Perl-compatible regex (lookahead, non-greedy)
grep -P '(?<=user=)\w+' access.log

# Fixed string (no regex interpretation)
grep -F '192.168.1.1' access.log

grep Pipeline Patterns

# Find IPs in a log, sort, count unique
grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' access.log | sort | uniq -c | sort -rn | head -20

# Find lines matching multiple patterns (AND logic)
grep 'error' app.log | grep 'database'

# Match any of several patterns (OR logic)
grep -E 'error|fatal|panic' app.log

# Exclude comment and empty lines from config
grep -Ev '^\s*(#|$)' nginx.conf

sed: Stream Editing

Basic Substitution

# Replace first occurrence per line
sed 's/old/new/' file

# Replace ALL occurrences per line (global flag)
sed 's/old/new/g' file

# In-place edit (modifies the file)
sed -i 's/old/new/g' file

# In-place with backup
sed -i.bak 's/old/new/g' file

# Use different delimiter (useful when pattern contains /)
sed 's|/usr/local|/opt|g' file

# Case-insensitive substitution (GNU sed)
sed 's/error/WARNING/gi' file

Line Selection

# Only line 5
sed -n '5p' file

# Lines 10-20
sed -n '10,20p' file

# Lines matching pattern
sed -n '/ERROR/p' file

# Delete lines matching pattern
sed '/^#/d' config.conf

# Delete empty lines
sed '/^$/d' file

# Delete lines 1-5
sed '1,5d' file

Advanced sed

# Capture groups and backreferences
echo "2024-03-15" | sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\3\/\2\/\1/'
# Output: 15/03/2024

# Insert line before match
sed '/\[server\]/i upstream_server = 10.0.0.1' config.ini

# Append line after match
sed '/\[server\]/a max_connections = 100' config.ini

# Replace between two patterns
sed '/START/,/END/s/foo/bar/g' file

# Multiple operations
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file

# Print only the replacement (combo with -n)
sed -n 's/.*user=\([^ ]*\).*/\1/p' access.log

awk: A Data Processing Language

Name origin: awk is named after its three creators: Alfred Aho, Peter Weinberger, and Brian Kernighan -- all at Bell Labs. First released in 1977. Aho later co-invented the Aho-Corasick string matching algorithm and co-authored the "Dragon Book" on compilers. sed stands for "stream editor" -- it reads input line by line and applies editing commands, designed by Lee McMahon at Bell Labs in 1974.

awk is not just a command — it is a programming language optimized for columnar text.

Basics

# Print second column (space-delimited)
awk '{print $2}' file

# Custom delimiter
awk -F: '{print $1, $3}' /etc/passwd

# Print last column
awk '{print $NF}' file

# Print second-to-last column
awk '{print $(NF-1)}' file

# With condition
awk '$3 > 100 {print $1, $3}' data.txt

# Pattern match
awk '/ERROR/ {print $0}' app.log

Built-in Variables

$0    entire line
$1    first field
$NF   last field
NR    current line number (across all files)
FNR   current line number (current file)
NF    number of fields in current line
FS    field separator (input)
OFS   output field separator
RS    record separator

awk as a Language

# Sum a column
awk '{sum += $3} END {print sum}' data.txt

# Average
awk '{sum += $3; n++} END {print sum/n}' data.txt

# Count occurrences by key
awk '{count[$1]++} END {for (k in count) print k, count[k]}' access.log

# Max value
awk 'BEGIN {max=0} $3 > max {max=$3; line=$0} END {print line}' data.txt

# Formatted output
awk '{printf "%-20s %10d\n", $1, $3}' data.txt

# Multiple field separators
awk -F'[: ]+' '{print $1, $2}' file

# BEGIN and END blocks
awk 'BEGIN {print "Name", "Size"} {print $9, $5} END {print "---done---"}' <(ls -l)

awk for Log Analysis

# Requests per hour from Apache log
awk '{print $4}' access.log | cut -d: -f2 | sort | uniq -c

# Average response time (assuming time is in column 10)
awk '{sum += $10; n++} END {printf "avg: %.2fms\n", sum/n}' access.log

# HTTP status code distribution
awk '{print $9}' access.log | sort | uniq -c | sort -rn

# Slow requests (> 1 second)
awk '$10 > 1000 {print $7, $10"ms"}' access.log

Text Processing Pipeline Building Blocks

The Essential Toolkit

grep     filter lines
sed      transform text
awk      columnar processing
sort     order lines
uniq     deduplicate (requires sorted input)
cut      extract columns (simpler than awk)
tr       translate/delete characters
wc       count lines/words/chars
tee      fork output to file and stdout
paste    merge lines side by side
comm     compare sorted files
join     join files on a common field
xargs    build commands from input

Pipeline Examples

# Top 10 IP addresses in access log
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10

# Convert CSV to TSV
tr ',' '\t' < data.csv > data.tsv

# Remove duplicate lines (preserving order)
awk '!seen[$0]++' file
# ↑ This is the most elegant awk one-liner ever written.
#   seen is an associative array, $0 is the current line,
#   ++ returns the old value (0 for first occurrence) then increments.
#   ! negates it: true for first occurrence, false for duplicates.

# Extract email addresses from text
grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' file

# Replace newlines with commas (join lines)
paste -sd, file

# Split a file every 1000 lines
split -l 1000 bigfile.log chunk_

# Process in parallel with xargs
cat urls.txt | xargs -P 10 -I {} curl -s {}

# Extract field from JSON lines (without jq)
grep -oP '"name"\s*:\s*"\K[^"]+' data.json

# Count lines of code by extension
find . -name '*.py' | xargs wc -l | tail -1

Practical Cheat Sheet

Task                          Command
────────────────────────────  ─────────────────────────────────────
Filter lines                  grep -E 'pattern' file
Invert filter                 grep -v 'pattern' file
Replace text                  sed 's/old/new/g' file
Delete lines                  sed '/pattern/d' file
Extract column                awk '{print $N}' file
Sum column                    awk '{s+=$N}END{print s}' file
Count by group                awk '{a[$1]++}END{for(k in a)print k,a[k]}'
Sort + unique count           sort | uniq -c | sort -rn
Remove blank lines            sed '/^$/d' file
Strip comments                grep -v '^\s*#' file
Convert delimiters            tr ',' '\t' or awk -F, -v OFS='\t' '{$1=$1}1'

Remember: The tool selection mnemonic: "Grep Grabs, Sed Substitutes, Awk Analyzes." grep filters lines (pattern matching), sed transforms text (substitution/deletion), awk processes structured data (columns, math, aggregation). When you need all three, pipe them: grep to filter, sed to clean, awk to compute.

Understanding these tools is not about memorizing syntax — it is about knowing which tool to reach for. grep filters, sed transforms, awk computes. Pipes connect them. That mental model covers 90% of text wrangling.


Wiki Navigation

Prerequisites