Text Processing: jq, awk, and sed in the Trenches
- lesson
- jq
- awk
- sed
- log-analysis
- json-processing
- pipelines
- kubectl
- monitoring-data ---# Text Processing — jq, awk, and sed in the Trenches
Topics: jq, awk, sed, log analysis, JSON processing, pipelines, kubectl, monitoring data Level: L1–L2 (Foundations → Operations) Time: 60–80 minutes Strategy: Parallel + build-up
The Mission¶
It's 4:17 PM on a Friday. Monitoring shows a traffic spike — 10x normal request volume started 45 minutes ago, and your API latency is climbing. Your manager asks the question nobody wants to hear before the weekend: "Can you figure out where the traffic is coming from before we decide whether to page the on-call team?"
You have a 2 GB access log on the production host. Some of your services log JSON, some log the traditional Apache Combined format. The Kubernetes cluster is serving multiple microservices and you need to cross-reference pod health with the log data.
Your tools: jq, awk, sed, and a terminal. No Splunk, no Datadog — the observability
stack is lagging behind the spike. It's just you and the command line.
Part 1: The Right Tool for the Shape of the Data¶
Before you touch the keyboard, the first question is: what format is this data in?
| Data shape | Reach for | Why |
|---|---|---|
| JSON (APIs, kubectl, structured logs) | jq | Understands nesting, arrays, types |
Columnar text (access logs, TSV, ps output) |
awk | Built-in field splitting, math, aggregation |
| Quick find-and-replace, line surgery | sed | Stream substitution, address ranges |
| Mix of formats in a pipeline | All three | Each handles one stage |
Mental Model: Think of it as three specialists in an ER. sed is the triage nurse — fast assessment, quick fixes, move on. awk is the radiologist — sees structure in everything, counts and measures. jq is the specialist — deep understanding of one complex format (JSON), and nobody else can do what it does.
Part 2: awk — "For Each Line, If Pattern Matches, Do Action"¶
Your first log is a traditional nginx access log. Two gigabytes of lines that look like this:
10.0.1.47 - - [22/Mar/2026:15:32:01 +0000] "GET /api/v2/products HTTP/1.1" 200 1432 "-" "python-requests/2.28.0"
Name Origin: awk is named after its three creators at Bell Labs: Alfred Aho, Peter Weinberger, and Brian Kernighan. Created in 1977. Aho was a formal languages expert (he also created egrep), Weinberger worked on databases, and Kernighan co-authored The C Programming Language. The name is literally their initials.
The mental model¶
Every awk program boils down to one sentence:
"For each line, if the pattern matches, do the action."
That's it. awk reads one line, checks the pattern, runs the action, moves to the next line.
No pattern means "every line." No action means "print the line." Let's use this to find
where your traffic spike is coming from.
Find the top talkers¶
# Count requests per IP address
awk '{ count[$1]++ } END { for (ip in count) print count[ip], ip }' access.log \
| sort -rn | head -20
Break that down piece by piece:
| Fragment | What it does |
|---|---|
count[$1]++ |
Uses an associative array keyed by field 1 (the IP). Increments on each hit. |
END { ... } |
Runs once after all input is processed. |
for (ip in count) |
Iterates every key in the array. |
print count[ip], ip |
Outputs the count and the IP. |
sort -rn \| head -20 |
Sorts numerically, descending. Top 20. |
You run it. The output:
Two IPs account for over a million requests. That's your spike.
Under the Hood: awk processes input in a single pass, reading one record at a time. It never holds the entire file in memory (unless you store every line in an array). This is why that command just chewed through 2 GB in under a minute using 800 KB of RAM. A Python script doing
readlines()on the same file would consume gigabytes.
Narrow it down: what are they hitting?¶
# Requests per endpoint for the suspicious IPs
awk '$1 == "10.0.5.200" { split($7, path, "?"); endpoints[path[1]]++ }
END { for (e in endpoints) print endpoints[e], e }' access.log \
| sort -rn | head -10
They're hammering /api/v2/products. Now you know the what and the who.
Time-based analysis: when did it start?¶
# Requests per minute from the suspicious IP
awk '$1 == "10.0.5.200" {
# Extract timestamp: [22/Mar/2026:15:32:01
match($4, /\[(.+):(..):(..):/, t)
minute = t[1] ":" t[2] ":" t[3]
buckets[minute]++
} END {
for (m in buckets) print m, buckets[m]
}' access.log | sort | tail -20
Gotcha: awk's
match()with capture groups is a GNU awk (gawk) extension. On systems with mawk (Debian/Ubuntu default), you'd needsubstr()andindex()instead, or switch togawkexplicitly. Check which you have:awk --version 2>/dev/null || awk -V.
awk's built-in variables — the cheat code¶
| Variable | Meaning | Use case |
|---|---|---|
$0 |
Entire line | Print the whole thing |
$1, $2, ...$NF |
Fields (1-indexed) | Extract specific columns |
NR |
Total records read so far | Line numbers, skip headers |
NF |
Number of fields in current line | Get last field: $NF |
FS |
Input field separator | -F: or BEGIN { FS=":" } |
OFS |
Output field separator | BEGIN { OFS="\t" } for TSV |
# Skip the header line of a CSV
awk -F, 'NR > 1 { print $2 }' data.csv
# Print line numbers alongside content
awk '{ print NR, $0 }' file.txt
# Print the last field of every line (useful when you don't know the column count)
awk '{ print $NF }' mystery.log
Trivia: awk was created in 1977 — it predates Perl (1987), Python (1991), and Ruby (1995). Despite being nearly 50 years old, it ships on every Unix system and its core syntax is POSIX-standardized. The skill never expires.
Flashcard Check: awk¶
Cover the right column. Test yourself.
| Question | Answer |
|---|---|
What does $NF give you? |
The last field on the current line. |
| How do you set a colon delimiter? | awk -F: or BEGIN { FS=":" } |
| What runs before any input is read? | The BEGIN block. |
What does awk '!seen[$0]++' do? |
Removes duplicate lines while preserving order. |
| Why does awk beat Python for a 10 GB log? | Single-pass, no memory allocation per line. |
Part 3: jq — SQL for JSON¶
Your second log source is structured JSON. The API gateway writes one JSON object per line:
{"timestamp":"2026-03-22T15:32:01Z","method":"GET","path":"/api/v2/products","status":200,"duration_ms":42,"client_ip":"10.0.5.200","user_agent":"python-requests/2.28.0"}
grep and awk can technically parse this. But they'll break the second a value contains a comma, a nested object, or a quoted string with spaces. jq understands JSON.
Name Origin: jq was created by Stephen Dolan at the University of Cambridge in 2012. The name follows the Unix tradition of short, lowercase tool names. Its official tagline: "jq is like sed for JSON data." It has its own Turing-complete functional language, but most people use about 5% of it.
Dot notation: the gateway drug¶
# Pretty-print JSON (the first thing everyone learns)
echo '{"name":"web","replicas":3}' | jq .
# Extract a field
echo '{"name":"web","replicas":3}' | jq '.name'
# "web"
# Nested access — dot your way down
echo '{"spec":{"replicas":3}}' | jq '.spec.replicas'
# 3
select: filtering arrays¶
Mental Model: Think of jq as SQL for JSON.
.items[]is yourFROMclause.select()is yourWHERE.{name: .metadata.name}is yourSELECT. Once you see it this way, complex queries write themselves.
# Find the error requests in our JSON log
cat api-gateway.log | jq 'select(.status >= 500)'
# Combine select with field extraction
cat api-gateway.log | jq 'select(.client_ip == "10.0.5.200") | {path, status, duration_ms}'
The kubectl + jq power combo¶
This is where jq earns its place in your muscle memory. Kubernetes commands with -o json
produce deeply nested structures that are painful to read raw.
# List all non-Running pods — the single most useful k8s+jq command
kubectl get pods -o json | jq -r '
.items[]
| select(.status.phase != "Running")
| "\(.metadata.namespace)/\(.metadata.name) \(.status.phase)"
'
# Find pods with high restart counts
kubectl get pods -A -o json | jq -r '
.items[]
| select(.status.containerStatuses[]?.restartCount > 5)
| [.metadata.namespace, .metadata.name,
(.status.containerStatuses[0].restartCount | tostring)]
| @tsv
'
| jq concept | SQL equivalent | Example |
|---|---|---|
.items[] |
FROM items |
Iterate the array |
select(.status == 200) |
WHERE status = 200 |
Filter |
{name, status} |
SELECT name, status |
Pick columns |
group_by(.status) |
GROUP BY status |
Aggregate |
sort_by(.duration) |
ORDER BY duration |
Sort |
length |
COUNT(*) |
Count |
map(.price) \| add |
SUM(price) |
Sum |
map, reduce, and aggregation¶
Back to the traffic spike. Count errors by status code from JSON logs:
# Group and count by status code
cat api-gateway.log | jq -s '
group_by(.status)
| map({status: .[0].status, count: length})
| sort_by(-.count)
'
[
{"status": 200, "count": 89421},
{"status": 429, "count": 34201},
{"status": 500, "count": 1247}
]
That 429 count is telling — your rate limiter is firing. The 500s are the collateral damage.
Sum total request duration for the spike IP:
cat api-gateway.log | jq -s '
map(select(.client_ip == "10.0.5.200"))
| reduce .[] as $req (0; . + $req.duration_ms)
'
# 4829103 (total milliseconds consumed by this one client)
Remember: jq's reduce syntax:
reduce .[] as $item (INIT; UPDATE). INIT is the starting accumulator. UPDATE is applied per element. Same concept as Python'sfunctools.reduceor JavaScript'sArray.reduce.
Output formats: @csv, @tsv, and raw strings¶
jq can produce output formats other than JSON — critical for feeding data into other tools.
# TSV output for spreadsheets or further awk processing
cat api-gateway.log | jq -r '
select(.status >= 400)
| [.timestamp, .client_ip, .path, .status, .duration_ms]
| @tsv
'
# CSV output (properly escaped — handles commas in values)
cat api-gateway.log | jq -r '
[.timestamp, .path, .status]
| @csv
'
Gotcha: Forgetting
-rwhen piping jq output to other commands is the number one jq mistake. Without-r, strings include literal quotes:"web-pod-abc"instead ofweb-pod-abc. Those quotes breakxargs,forloops, and every tool expecting clean input.
Slurp mode: when you need the whole picture¶
By default, jq processes each line independently. The -s (slurp) flag reads all lines
into a single array first — required for sorting, grouping, or counting across lines.
# Without -s: length of each individual JSON object
cat lines.jsonl | jq 'length' # prints 5, 7, 3, 6, ...
# With -s: total number of lines
cat lines.jsonl | jq -s 'length' # prints 4821
# Sort all entries by timestamp
cat lines.jsonl | jq -s 'sort_by(.timestamp)'
Under the Hood: jq compiles your filter into a bytecode VM that processes JSON in a single pass. For large files, this is dramatically faster than loading into Python. But
-s(slurp) breaks this — it loads the entire input into memory. On a 2 GB JSONL file, slurp mode will consume 2+ GB of RAM. For large-file aggregation, consider streaming approaches orawkinstead.
Flashcard Check: jq¶
| Question | Answer |
|---|---|
What does -r do? |
Strips JSON string quotes from output (raw output). |
| How do you provide a default for missing fields? | The alternative operator: .name // "unknown" |
What's the difference between .[] and map()? |
.[] produces bare values; map() wraps results in an array. |
| How do you pass a shell variable into jq safely? | --arg varname "$SHELL_VAR" — never interpolate with double quotes. |
What does -s (slurp) do? |
Reads all input into a single array before filtering. |
What does -e do? |
Returns non-zero exit code when result is null or false. |
Part 4: sed — The Stream Surgeon¶
You've identified the spike source. Now you need to clean and transform data for a report. This is sed's territory: surgical text transformations on a stream.
Name Origin: sed stands for stream editor. Written by Lee McMahon at Bell Labs in 1973–1974, it was designed as a non-interactive version of the
edline editor. Thes/old/new/syntax comes directly fromed's substitute command — the same syntax later influenced Perl, vim, and every modern IDE's find-and-replace.
The substitute command: 90% of what you'll use¶
# Basic substitution (first occurrence per line)
sed 's/error/ERROR/' logfile.txt
# Global: all occurrences per line
sed 's/error/ERROR/g' logfile.txt
# Case-insensitive (GNU sed)
sed 's/error/ERROR/gI' logfile.txt
Practical scenario: sanitize IPs for a report¶
Your manager wants the log excerpt but with client IPs masked for the incident report:
# Mask IPv4 addresses in the log
sed -E 's/[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/xxx.xxx.xxx.xxx/g' access.log > sanitized.log
sed -i: in-place editing (with a portability trap)¶
# Edit a config file in place — Linux (GNU sed)
sed -i 's/DEBUG=true/DEBUG=false/' .env
# Same thing on macOS (BSD sed) — requires backup extension argument
sed -i '' 's/DEBUG=true/DEBUG=false/' .env
Gotcha: This macOS vs Linux difference has burned every engineer who writes CI scripts. GNU sed's
-itakes an optional backup suffix. BSD sed's-irequires a mandatory argument (even if it's an empty string). The portable fix:sed -i.bak 's/old/new/' file && rm file.bak— or just useperl -pi -e 's/old/new/' file, which works identically everywhere.
Address ranges: operate on specific lines¶
# Delete comment lines
sed '/^#/d' nginx.conf
# Extract a time window from a log
sed -n '/2026-03-22 15:30/,/2026-03-22 16:00/p' access.log
# Replace only within a specific section
sed '/\[production\]/,/\[staging\]/s/timeout=30/timeout=60/' config.ini
The hold space: sed's second clipboard¶
sed has two buffers: the pattern space (current line being processed) and the hold space (a scratch pad that persists between lines).
| Command | What it does |
|---|---|
h |
Copy pattern space to hold space |
H |
Append pattern space to hold space |
g |
Copy hold space to pattern space |
G |
Append hold space to pattern space |
x |
Swap pattern space and hold space |
Honest advice: if you need more than basic hold-space operations, reach for awk. Hold-space gymnastics are clever but unmaintainable. The person reading your script at 3 AM (including future you) will not thank you.
Trivia: Eric Pement's "sed one-liners" collection, first published in the 1990s, became a canonical reference passed between Unix administrators for decades before Stack Overflow existed. Verifiable at
sed.sourceforge.net/sed1line.txt.
Part 5: Same Problem, Three Ways¶
Here's where the tools' strengths become visceral. Same task: find the top 5 client IPs by request count.
Traditional access log (columnar text) → awk¶
awk '{ count[$1]++ } END { for (ip in count) print count[ip], ip }' access.log \
| sort -rn | head -5
Why awk wins here: the data is columnar. $1 is the IP. No parsing needed.
JSON logs → jq¶
Or, doing it entirely in jq (slower, but no external tools):
cat api-gateway.log | jq -s '
group_by(.client_ip)
| map({ip: .[0].client_ip, count: length})
| sort_by(-.count)
| .[:5]
'
Why jq wins here: the IP is a nested field. awk would need to parse JSON structure.
Quick extraction from messy text → sed + sort pipeline¶
Suppose your log has inconsistent formatting and you just need to yank out the IPs:
Why sed works here: you don't need field splitting or JSON parsing. You need one regex extraction from messy, inconsistent text.
Mental Model: awk thinks in columns. jq thinks in trees. sed thinks in patterns. The data shape determines the tool.
Part 6: Real Pipelines — Combining Tools¶
The real power emerges when you chain tools together. Each handles the stage it's best at.
Pipeline 1: JSON logs → jq extracts → awk aggregates¶
# Average response time per endpoint from JSON logs
cat api-gateway.log \
| jq -r '[.path, .duration_ms] | @tsv' \
| awk -F'\t' '{ sum[$1] += $2; count[$1]++ }
END { for (p in sum) printf "%s\t%.1f ms\t(%d reqs)\n", p, sum[p]/count[p], count[p] }' \
| sort -t$'\t' -k2 -rn \
| head -10
Why this works: jq handles the JSON extraction and outputs clean TSV. awk handles the math (averages, counts). sort handles the ordering. Each tool does what it's best at.
Pipeline 2: kubectl → jq → sed → final output¶
# Get pod resource requests, formatted for a report
kubectl get pods -o json \
| jq -r '.items[] | [.metadata.name, .spec.containers[0].resources.requests.cpu // "none",
.spec.containers[0].resources.requests.memory // "none"] | @tsv' \
| sed 's/none/⚠ NOT SET/g' \
| column -t
Pipeline 3: Process monitoring data with awk¶
# From Prometheus text format: find metrics above threshold
curl -s http://localhost:9090/metrics \
| awk '/^http_requests_total/ && !/^#/ {
split($0, parts, " ")
value = parts[2]
if (value+0 > 10000) print $0
}'
Pipeline 4: Cross-referencing two data sources¶
The spike IPs from the access log — are they hitting pods that are unhealthy?
# Step 1: Get the suspicious IPs
SPIKE_IPS=$(awk '{ count[$1]++ } END {
for (ip in count) if (count[ip] > 100000) print ip
}' access.log)
# Step 2: Check which pods those IPs are hitting (from JSON logs)
for ip in $SPIKE_IPS; do
echo "=== $ip ==="
cat api-gateway.log \
| jq -r --arg ip "$ip" 'select(.client_ip == $ip) | .path' \
| sort | uniq -c | sort -rn | head -5
done
# Step 3: Check pod health for those endpoints
kubectl get pods -o json | jq -r '
.items[]
| select(.status.containerStatuses[]?.restartCount > 0)
| [.metadata.name, .status.phase,
(.status.containerStatuses[0].restartCount | tostring)]
| @tsv
' | column -t
Part 7: Performance — When awk Beats Python (and When It Doesn't)¶
The benchmark everyone should see¶
Processing a 10 GB access log to count requests per IP:
| Approach | Time | Memory |
|---|---|---|
awk '{c[$1]++} END {for(i in c) print c[i],i}' |
~3 min | ~800 KB |
sort \| uniq -c |
~8 min | ~200 KB (disk-backed sort) |
Python with readlines() |
~45 min | ~12 GB |
| Python with line-by-line read | ~15 min | ~50 MB |
War Story: A junior engineer wrote a Python script to parse a 50 GB access log and count requests per IP. It took 45 minutes and consumed 12 GB of RAM before finishing. The equivalent awk one-liner ran in 3 minutes using 800 KB. The lesson: awk (and mawk) are purpose-built for this shape of work. Save Python for when you need libraries, complex data structures, or error handling beyond what awk provides.
When Python wins¶
- Complex JSON transformations with error handling and retries
- Data that needs to be joined across multiple files with different schemas
- Anything involving HTTP requests, databases, or third-party APIs
- When the "script" grows beyond ~10 lines of awk
mawk vs gawk: it matters at scale¶
| Implementation | Default on | Speed | Features |
|---|---|---|---|
| gawk (GNU awk) | Red Hat, Fedora, Arch | Baseline | Full: networking, time functions, match() captures |
| mawk (Mike's awk) | Debian, Ubuntu | 2–10x faster | Minimal: no match() captures, no networking |
Trivia: mawk's speed advantage comes from using a bytecode interpreter instead of gawk's tree-walking interpreter. For small files, the difference is negligible. For multi-gigabyte log processing, mawk can be 10x faster. Michael Brennan wrote mawk in the early 1990s, and it remains the default awk on Debian-derived distributions.
Part 8: Processing CSV/TSV Data¶
Real operational data comes in CSV and TSV constantly — cloud billing exports, inventory lists, monitoring data exports.
awk for CSV (with caveats)¶
# Sum the third column of a CSV (skip header)
awk -F, 'NR > 1 { sum += $3 } END { printf "Total: $%.2f\n", sum }' invoices.csv
# Filter rows where status column contains "failed"
awk -F, '$4 ~ /failed/' jobs.csv
# Reformat: swap columns, change delimiter
awk -F, 'BEGIN { OFS="\t" } NR > 1 { print $3, $1, $2 }' data.csv > data.tsv
Gotcha: awk's CSV parsing is naive — it splits on every comma, including commas inside quoted fields.
"Smith, John",42,adminbreaks into four fields instead of three. For proper CSV with quoted fields, usegawkwithFPATor switch to Python'scsvmodule. For quick-and-dirty ops work where you control the data, awk is fine.
jq for CSV/TSV output¶
# Convert JSON array to CSV
echo '[{"name":"web","cpu":"500m"},{"name":"api","cpu":"1000m"}]' \
| jq -r '.[] | [.name, .cpu] | @csv'
# "web","500m"
# "api","1000m"
# TSV (no quoting, tab-separated)
echo '[{"name":"web","cpu":"500m"}]' \
| jq -r '.[] | [.name, .cpu] | @tsv'
# web 500m
sed for CSV cleanup¶
# Remove Windows carriage returns from a CSV
sed 's/\r$//' windows_export.csv > clean.csv
# Strip surrounding quotes from every field
sed 's/"//g' quoted.csv
# Replace delimiter: semicolons to commas (European CSV exports)
sed 's/;/,/g' european.csv > standard.csv
Part 9: The Decision Tree¶
You're staring at data. Which tool do you reach for?
Is the data JSON?
├── Yes → jq
│ ├── Simple field extraction → jq '.field'
│ ├── Filtering → jq 'select()'
│ ├── Aggregation → jq -s 'group_by() | map()'
│ └── Need to feed into other tools → jq -r '... | @tsv' | awk/sort/...
│
├── No → Is it columnar (fixed fields per line)?
│ ├── Yes → awk
│ │ ├── Extract fields → awk '{print $2, $5}'
│ │ ├── Filter + extract → awk '/pattern/ {print $1}'
│ │ ├── Count/sum/average → awk '{sum+=$3} END {print sum}'
│ │ └── Group by key → awk '{c[$1]++} END {for(k in c) print k,c[k]}'
│ │
│ └── No → Is it a quick text transformation?
│ ├── Yes → sed
│ │ ├── Find-and-replace → sed 's/old/new/g'
│ │ ├── Delete lines → sed '/pattern/d'
│ │ ├── Extract range → sed -n '/start/,/end/p'
│ │ └── In-place config edit → sed -i 's/old/new/' file
│ │
│ └── Complex/multi-format → Write a Python script
Interview Bridge: "When would you use sed vs awk vs jq?" is a common DevOps interview question. The clean answer: jq for JSON, awk for columnar data and aggregation, sed for simple substitutions and line surgery. When the task grows beyond ~10 lines, switch to Python. Showing you know when to use each tool matters more than memorizing syntax.
Exercises¶
Exercise 1: Quick Win (2 minutes)¶
Given this JSON, extract just the names of pods that are not "Running":
{"items":[{"metadata":{"name":"web-abc"},"status":{"phase":"Running"}},{"metadata":{"name":"api-def"},"status":{"phase":"CrashLoopBackOff"}},{"metadata":{"name":"db-ghi"},"status":{"phase":"Running"}}]}
Solution
Exercise 2: awk Aggregation (5 minutes)¶
You have an access log line format: IP - - [timestamp] "METHOD /path HTTP/1.1" STATUS BYTES
Write an awk one-liner that prints the total bytes served per unique IP address, sorted by bytes descending. Field 10 is the byte count.
Hint
Use an associative array keyed on `$1` (the IP), accumulating `$10`.Solution
Exercise 3: Pipeline Composition (10 minutes)¶
Your JSON API logs contain {"timestamp":"...","status":200,"path":"/api/v2/products","duration_ms":42}.
Build a pipeline that:
1. Extracts only requests to paths starting with /api/v2/
2. Groups by path
3. Shows average duration and request count per path
4. Outputs as a formatted table
Hint
Use jq to filter and extract TSV, then awk for aggregation, then `column -t` for formatting.Solution
Exercise 4: sed Surgery (5 minutes)¶
You have a config file with lines like # ENABLE_FEATURE_X=true (commented out).
Write a sed command that uncomments only the line containing FEATURE_X without
affecting any other commented lines.
Solution
Cheat Sheet¶
jq¶
| Pattern | Command |
|---|---|
| Pretty-print | jq . |
| Extract field | jq '.metadata.name' |
| Raw output (no quotes) | jq -r '.name' |
| Filter array | jq '.[] \| select(.status != "Running")' |
| Transform all elements | jq 'map(.name)' |
| Construct new object | jq '{name: .metadata.name, phase: .status.phase}' |
| Default for missing field | jq '.timeout // 30' |
| Count items | jq -s 'length' |
| Group and count | jq -s 'group_by(.key) \| map({key: .[0].key, n: length})' |
| CSV output | jq -r '[.f1, .f2] \| @csv' |
| TSV output | jq -r '[.f1, .f2] \| @tsv' |
| Pass shell variable | jq --arg v "$VAR" 'select(.name == $v)' |
awk¶
| Pattern | Command |
|---|---|
| Print a field | awk '{print $2}' |
| Set delimiter | awk -F: '{print $1}' |
| Filter by pattern | awk '/ERROR/ {print}' |
| Filter by field value | awk '$3 > 100' |
| Count by key | awk '{c[$1]++} END {for(k in c) print k,c[k]}' |
| Sum a column | awk '{s+=$3} END {print s}' |
| Average | awk '{s+=$3;n++} END {print s/n}' |
| Skip header | awk 'NR > 1' |
| Deduplicate | awk '!seen[$0]++' |
| Printf formatting | awk '{printf "%-20s %5d\n", $1, $2}' |
sed¶
| Pattern | Command |
|---|---|
| Substitute (first per line) | sed 's/old/new/' |
| Substitute (all per line) | sed 's/old/new/g' |
| In-place edit (portable) | sed -i.bak 's/old/new/g' file && rm file.bak |
| Delete matching lines | sed '/pattern/d' |
| Print only matches | sed -n '/pattern/p' |
| Extract line range | sed -n '10,20p' |
| Extract pattern range | sed -n '/START/,/END/p' |
| Use different delimiter | sed 's\|/old/path\|/new/path\|g' |
| Extended regex | sed -E 's/(group)/\1/' |
| Delete blank lines | sed '/^$/d' |
Takeaways¶
-
Match tool to data shape. jq for JSON, awk for columns, sed for text surgery. Forcing the wrong tool wastes time and produces fragile commands.
-
awk's mental model in one sentence: "For each line, if pattern matches, do action." Associative arrays make it a one-pass aggregation engine.
-
jq is SQL for JSON.
.items[]= FROM,select()= WHERE,{name, status}= SELECT. Learn these three and you handle 90% of kubectl/API work. -
sed is a stream surgeon, not a programmer. Substitution and line-level operations are its strength. The hold space exists, but if you need it, you probably need awk instead.
-
Pipelines beat monoliths. jq extracts → awk aggregates → sort orders → head limits. Each tool handles one stage. This composability is the Unix philosophy in action.
-
awk processes gigabyte files in constant memory. Single-pass, line-at-a-time processing means awk handles files that would crash a naive Python script. Know when to reach for it.
Related Lessons¶
- strace: Reading the Matrix — when the problem is below your application and you need to see the syscall conversation
- What Happens When You Type a Regex — understand NFA vs DFA backtracking and why some regex patterns are dangerous
- What Happens Inside a Linux Pipe — the kernel mechanics behind every
|in your pipelines - Why Everything Uses JSON Now — the history and tradeoffs of JSON as the universal data format