Portal | Level: L1: Foundations | Topics: Regex & Text Wrangling, Bash / Shell Scripting, Linux Fundamentals | Domain: CLI Tools
Regex & Text Wrangling - Primer¶
Why This Matters¶
Every infrastructure task eventually becomes a text-processing task. Logs are text. Configs are text. API responses are text. If you cannot slice, dice, and transform text fluently from the command line, you are copying things into spreadsheets — and that does not scale.
This primer covers regex syntax (and its three annoying dialects), sed for stream editing, awk as a full data-processing language, and grep for searching. Master these and you can extract answers from any log file, rewrite any config, and build data pipelines without writing a script.
Regex Fundamentals¶
The Three Dialects You Must Know¶
| Feature | BRE (Basic) | ERE (Extended) | PCRE (Perl) |
|---|---|---|---|
| Used by | grep, sed |
grep -E, awk |
grep -P, Perl |
| Grouping | \( \) |
( ) |
( ) |
| Alternation | \| |
\| |
\| |
Quantifier + |
\+ |
+ |
+ |
Quantifier ? |
\? |
? |
? |
Quantifier {n,m} |
\{n,m\} |
{n,m} |
{n,m} |
| Lookahead | No | No | (?=...), (?!...) |
| Non-greedy | No | No | *?, +? |
The escaping differences between BRE and ERE cause more wasted hours than any regex complexity. When in doubt, use ERE (grep -E or sed -E).
Name origin: "Regular expressions" were invented by mathematician Stephen Kleene in 1951 as a notation for describing "regular languages" in formal language theory. Ken Thompson implemented them in the QED text editor in 1968 and then in
edandgrepon Unix. The*(Kleene star) is named after Kleene.
Core Syntax Reference¶
. any single character
^ start of line
$ end of line
* zero or more of preceding
+ one or more of preceding (ERE/PCRE)
? zero or one of preceding (ERE/PCRE)
[abc] character class: a, b, or c
[^abc] negated class: not a, b, or c
[a-z] range: lowercase letters
\d digit (PCRE only; use [0-9] for portability)
\w word character (PCRE; use [a-zA-Z0-9_] for portability)
\s whitespace (PCRE; use [[:space:]] for portability)
\b word boundary (PCRE)
(...) capture group (ERE/PCRE)
\1 backreference to first capture group
| alternation (ERE/PCRE)
POSIX Character Classes (Portable)¶
[[:digit:]] = [0-9]
[[:alpha:]] = [a-zA-Z]
[[:alnum:]] = [a-zA-Z0-9]
[[:space:]] = whitespace (space, tab, newline)
[[:upper:]] = [A-Z]
[[:lower:]] = [a-z]
[[:punct:]] = punctuation
Use these in sed and awk for locale-safe matching.
grep: Search and Filter¶
Essential Flags¶
# Extended regex (ERE) — avoids backslash hell
grep -E 'pattern' file
# Case insensitive
grep -i 'error' /var/log/syslog
# Line numbers
grep -n 'FATAL' app.log
# Count matches
grep -c 'error' app.log
# Invert match (lines that do NOT match)
grep -v '^#' config.conf
# Show context (3 lines before, 3 after)
grep -B3 -A3 'OOM' /var/log/kern.log
# Recursive search in directory
grep -r 'password' /etc/
# Only filenames
grep -rl 'TODO' src/
# Perl-compatible regex (lookahead, non-greedy)
grep -P '(?<=user=)\w+' access.log
# Fixed string (no regex interpretation)
grep -F '192.168.1.1' access.log
grep Pipeline Patterns¶
# Find IPs in a log, sort, count unique
grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' access.log | sort | uniq -c | sort -rn | head -20
# Find lines matching multiple patterns (AND logic)
grep 'error' app.log | grep 'database'
# Match any of several patterns (OR logic)
grep -E 'error|fatal|panic' app.log
# Exclude comment and empty lines from config
grep -Ev '^\s*(#|$)' nginx.conf
sed: Stream Editing¶
Basic Substitution¶
# Replace first occurrence per line
sed 's/old/new/' file
# Replace ALL occurrences per line (global flag)
sed 's/old/new/g' file
# In-place edit (modifies the file)
sed -i 's/old/new/g' file
# In-place with backup
sed -i.bak 's/old/new/g' file
# Use different delimiter (useful when pattern contains /)
sed 's|/usr/local|/opt|g' file
# Case-insensitive substitution (GNU sed)
sed 's/error/WARNING/gi' file
Line Selection¶
# Only line 5
sed -n '5p' file
# Lines 10-20
sed -n '10,20p' file
# Lines matching pattern
sed -n '/ERROR/p' file
# Delete lines matching pattern
sed '/^#/d' config.conf
# Delete empty lines
sed '/^$/d' file
# Delete lines 1-5
sed '1,5d' file
Advanced sed¶
# Capture groups and backreferences
echo "2024-03-15" | sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\3\/\2\/\1/'
# Output: 15/03/2024
# Insert line before match
sed '/\[server\]/i upstream_server = 10.0.0.1' config.ini
# Append line after match
sed '/\[server\]/a max_connections = 100' config.ini
# Replace between two patterns
sed '/START/,/END/s/foo/bar/g' file
# Multiple operations
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file
# Print only the replacement (combo with -n)
sed -n 's/.*user=\([^ ]*\).*/\1/p' access.log
awk: A Data Processing Language¶
Name origin:
awkis named after its three creators: Alfred Aho, Peter Weinberger, and Brian Kernighan -- all at Bell Labs. First released in 1977. Aho later co-invented the Aho-Corasick string matching algorithm and co-authored the "Dragon Book" on compilers.sedstands for "stream editor" -- it reads input line by line and applies editing commands, designed by Lee McMahon at Bell Labs in 1974.
awk is not just a command — it is a programming language optimized for columnar text.
Basics¶
# Print second column (space-delimited)
awk '{print $2}' file
# Custom delimiter
awk -F: '{print $1, $3}' /etc/passwd
# Print last column
awk '{print $NF}' file
# Print second-to-last column
awk '{print $(NF-1)}' file
# With condition
awk '$3 > 100 {print $1, $3}' data.txt
# Pattern match
awk '/ERROR/ {print $0}' app.log
Built-in Variables¶
$0 entire line
$1 first field
$NF last field
NR current line number (across all files)
FNR current line number (current file)
NF number of fields in current line
FS field separator (input)
OFS output field separator
RS record separator
awk as a Language¶
# Sum a column
awk '{sum += $3} END {print sum}' data.txt
# Average
awk '{sum += $3; n++} END {print sum/n}' data.txt
# Count occurrences by key
awk '{count[$1]++} END {for (k in count) print k, count[k]}' access.log
# Max value
awk 'BEGIN {max=0} $3 > max {max=$3; line=$0} END {print line}' data.txt
# Formatted output
awk '{printf "%-20s %10d\n", $1, $3}' data.txt
# Multiple field separators
awk -F'[: ]+' '{print $1, $2}' file
# BEGIN and END blocks
awk 'BEGIN {print "Name", "Size"} {print $9, $5} END {print "---done---"}' <(ls -l)
awk for Log Analysis¶
# Requests per hour from Apache log
awk '{print $4}' access.log | cut -d: -f2 | sort | uniq -c
# Average response time (assuming time is in column 10)
awk '{sum += $10; n++} END {printf "avg: %.2fms\n", sum/n}' access.log
# HTTP status code distribution
awk '{print $9}' access.log | sort | uniq -c | sort -rn
# Slow requests (> 1 second)
awk '$10 > 1000 {print $7, $10"ms"}' access.log
Text Processing Pipeline Building Blocks¶
The Essential Toolkit¶
grep filter lines
sed transform text
awk columnar processing
sort order lines
uniq deduplicate (requires sorted input)
cut extract columns (simpler than awk)
tr translate/delete characters
wc count lines/words/chars
tee fork output to file and stdout
paste merge lines side by side
comm compare sorted files
join join files on a common field
xargs build commands from input
Pipeline Examples¶
# Top 10 IP addresses in access log
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10
# Convert CSV to TSV
tr ',' '\t' < data.csv > data.tsv
# Remove duplicate lines (preserving order)
awk '!seen[$0]++' file
# ↑ This is the most elegant awk one-liner ever written.
# seen is an associative array, $0 is the current line,
# ++ returns the old value (0 for first occurrence) then increments.
# ! negates it: true for first occurrence, false for duplicates.
# Extract email addresses from text
grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' file
# Replace newlines with commas (join lines)
paste -sd, file
# Split a file every 1000 lines
split -l 1000 bigfile.log chunk_
# Process in parallel with xargs
cat urls.txt | xargs -P 10 -I {} curl -s {}
# Extract field from JSON lines (without jq)
grep -oP '"name"\s*:\s*"\K[^"]+' data.json
# Count lines of code by extension
find . -name '*.py' | xargs wc -l | tail -1
Practical Cheat Sheet¶
Task Command
──────────────────────────── ─────────────────────────────────────
Filter lines grep -E 'pattern' file
Invert filter grep -v 'pattern' file
Replace text sed 's/old/new/g' file
Delete lines sed '/pattern/d' file
Extract column awk '{print $N}' file
Sum column awk '{s+=$N}END{print s}' file
Count by group awk '{a[$1]++}END{for(k in a)print k,a[k]}'
Sort + unique count sort | uniq -c | sort -rn
Remove blank lines sed '/^$/d' file
Strip comments grep -v '^\s*#' file
Convert delimiters tr ',' '\t' or awk -F, -v OFS='\t' '{$1=$1}1'
Remember: The tool selection mnemonic: "Grep Grabs, Sed Substitutes, Awk Analyzes."
grepfilters lines (pattern matching),sedtransforms text (substitution/deletion),awkprocesses structured data (columns, math, aggregation). When you need all three, pipe them: grep to filter, sed to clean, awk to compute.
Understanding these tools is not about memorizing syntax — it is about knowing which tool to reach for. grep filters, sed transforms, awk computes. Pipes connect them. That mental model covers 90% of text wrangling.
Wiki Navigation¶
Prerequisites¶
- Advanced Bash for Ops (Topic Pack, L1)
Related Content¶
- Advanced Bash for Ops (Topic Pack, L1) — Bash / Shell Scripting, Linux Fundamentals
- Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Bash / Shell Scripting, Linux Fundamentals
- Environment Variables (Topic Pack, L1) — Bash / Shell Scripting, Linux Fundamentals
- LPIC / LFCS Exam Preparation (Topic Pack, L2) — Bash / Shell Scripting, Linux Fundamentals
- Linux Ops (Topic Pack, L0) — Bash / Shell Scripting, Linux Fundamentals
- Linux Ops Drills (Drill, L0) — Bash / Shell Scripting, Linux Fundamentals
- Pipes & Redirection (Topic Pack, L1) — Bash / Shell Scripting, Linux Fundamentals
- Process Management (Topic Pack, L1) — Bash / Shell Scripting, Linux Fundamentals
- RHCE (EX294) Exam Preparation (Topic Pack, L2) — Bash / Shell Scripting, Linux Fundamentals
- Track: Foundations (Reference, L0) — Bash / Shell Scripting, Linux Fundamentals