Text Processing: jq, awk, and sed in the Trenches

lesson
jq
awk
sed
log-analysis
json-processing
pipelines
kubectl
monitoring-data ---# Text Processing — jq, awk, and sed in the Trenches

Topics: jq, awk, sed, log analysis, JSON processing, pipelines, kubectl, monitoring data Level: L1–L2 (Foundations → Operations) Time: 60–80 minutes Strategy: Parallel + build-up

The Mission¶

It's 4:17 PM on a Friday. Monitoring shows a traffic spike — 10x normal request volume started 45 minutes ago, and your API latency is climbing. Your manager asks the question nobody wants to hear before the weekend: "Can you figure out where the traffic is coming from before we decide whether to page the on-call team?"

You have a 2 GB access log on the production host. Some of your services log JSON, some log the traditional Apache Combined format. The Kubernetes cluster is serving multiple microservices and you need to cross-reference pod health with the log data.

Your tools: jq, awk, sed, and a terminal. No Splunk, no Datadog — the observability stack is lagging behind the spike. It's just you and the command line.

Part 1: The Right Tool for the Shape of the Data¶

Before you touch the keyboard, the first question is: what format is this data in?

Data shape	Reach for	Why
JSON (APIs, kubectl, structured logs)	jq	Understands nesting, arrays, types
Columnar text (access logs, TSV, `ps` output)	awk	Built-in field splitting, math, aggregation
Quick find-and-replace, line surgery	sed	Stream substitution, address ranges
Mix of formats in a pipeline	All three	Each handles one stage

Mental Model: Think of it as three specialists in an ER. sed is the triage nurse — fast assessment, quick fixes, move on. awk is the radiologist — sees structure in everything, counts and measures. jq is the specialist — deep understanding of one complex format (JSON), and nobody else can do what it does.

Part 2: awk — "For Each Line, If Pattern Matches, Do Action"¶

Your first log is a traditional nginx access log. Two gigabytes of lines that look like this:

10.0.1.47 - - [22/Mar/2026:15:32:01 +0000] "GET /api/v2/products HTTP/1.1" 200 1432 "-" "python-requests/2.28.0"

Name Origin: awk is named after its three creators at Bell Labs: Alfred Aho, Peter Weinberger, and Brian Kernighan. Created in 1977. Aho was a formal languages expert (he also created egrep), Weinberger worked on databases, and Kernighan co-authored The C Programming Language. The name is literally their initials.

The mental model¶

Every awk program boils down to one sentence:

"For each line, if the pattern matches, do the action."

pattern { action }

That's it. awk reads one line, checks the pattern, runs the action, moves to the next line. No pattern means "every line." No action means "print the line." Let's use this to find where your traffic spike is coming from.

Find the top talkers¶

# Count requests per IP address
awk '{ count[$1]++ } END { for (ip in count) print count[ip], ip }' access.log \
  | sort -rn | head -20

Break that down piece by piece:

Fragment	What it does
`count[$1]++`	Uses an associative array keyed by field 1 (the IP). Increments on each hit.
`END { ... }`	Runs once after all input is processed.
`for (ip in count)`	Iterates every key in the array.
`print count[ip], ip`	Outputs the count and the IP.
`sort -rn \\| head -20`	Sorts numerically, descending. Top 20.

You run it. The output:

847291  10.0.5.200
312044  10.0.5.201
  8921  10.0.1.47
  7432  192.168.1.100
  ...

Two IPs account for over a million requests. That's your spike.

Under the Hood: awk processes input in a single pass, reading one record at a time. It never holds the entire file in memory (unless you store every line in an array). This is why that command just chewed through 2 GB in under a minute using 800 KB of RAM. A Python script doing readlines() on the same file would consume gigabytes.

Narrow it down: what are they hitting?¶

# Requests per endpoint for the suspicious IPs
awk '$1 == "10.0.5.200" { split($7, path, "?"); endpoints[path[1]]++ }
     END { for (e in endpoints) print endpoints[e], e }' access.log \
  | sort -rn | head -10

693201  /api/v2/products
142882  /api/v2/search
 11208  /api/v2/users

They're hammering /api/v2/products. Now you know the what and the who.

Time-based analysis: when did it start?¶

# Requests per minute from the suspicious IP
awk '$1 == "10.0.5.200" {
    # Extract timestamp: [22/Mar/2026:15:32:01
    match($4, /\[(.+):(..):(..):/, t)
    minute = t[1] ":" t[2] ":" t[3]
    buckets[minute]++
} END {
    for (m in buckets) print m, buckets[m]
}' access.log | sort | tail -20

Gotcha: awk's match() with capture groups is a GNU awk (gawk) extension. On systems with mawk (Debian/Ubuntu default), you'd need substr() and index() instead, or switch to gawk explicitly. Check which you have: awk --version 2>/dev/null || awk -V.

awk's built-in variables — the cheat code¶

Variable	Meaning	Use case
`$0`	Entire line	Print the whole thing
`$1, $2, ...$NF`	Fields (1-indexed)	Extract specific columns
`NR`	Total records read so far	Line numbers, skip headers
`NF`	Number of fields in current line	Get last field: `$NF`
`FS`	Input field separator	`-F:` or `BEGIN { FS=":" }`
`OFS`	Output field separator	`BEGIN { OFS="\t" }` for TSV

# Skip the header line of a CSV
awk -F, 'NR > 1 { print $2 }' data.csv

# Print line numbers alongside content
awk '{ print NR, $0 }' file.txt

# Print the last field of every line (useful when you don't know the column count)
awk '{ print $NF }' mystery.log

Trivia: awk was created in 1977 — it predates Perl (1987), Python (1991), and Ruby (1995). Despite being nearly 50 years old, it ships on every Unix system and its core syntax is POSIX-standardized. The skill never expires.

Flashcard Check: awk¶

Cover the right column. Test yourself.

Question	Answer
What does `$NF` give you?	The last field on the current line.
How do you set a colon delimiter?	`awk -F:` or `BEGIN { FS=":" }`
What runs before any input is read?	The `BEGIN` block.
What does `awk '!seen[$0]++'` do?	Removes duplicate lines while preserving order.
Why does awk beat Python for a 10 GB log?	Single-pass, no memory allocation per line.

Part 3: jq — SQL for JSON¶

Your second log source is structured JSON. The API gateway writes one JSON object per line:

{"timestamp":"2026-03-22T15:32:01Z","method":"GET","path":"/api/v2/products","status":200,"duration_ms":42,"client_ip":"10.0.5.200","user_agent":"python-requests/2.28.0"}

grep and awk can technically parse this. But they'll break the second a value contains a comma, a nested object, or a quoted string with spaces. jq understands JSON.

Name Origin: jq was created by Stephen Dolan at the University of Cambridge in 2012. The name follows the Unix tradition of short, lowercase tool names. Its official tagline: "jq is like sed for JSON data." It has its own Turing-complete functional language, but most people use about 5% of it.

Dot notation: the gateway drug¶

# Pretty-print JSON (the first thing everyone learns)
echo '{"name":"web","replicas":3}' | jq .

# Extract a field
echo '{"name":"web","replicas":3}' | jq '.name'
# "web"

# Nested access — dot your way down
echo '{"spec":{"replicas":3}}' | jq '.spec.replicas'
# 3

select: filtering arrays¶

Mental Model: Think of jq as SQL for JSON. .items[] is your FROM clause. select() is your WHERE. {name: .metadata.name} is your SELECT. Once you see it this way, complex queries write themselves.

# Find the error requests in our JSON log
cat api-gateway.log | jq 'select(.status >= 500)'

# Combine select with field extraction
cat api-gateway.log | jq 'select(.client_ip == "10.0.5.200") | {path, status, duration_ms}'

The kubectl + jq power combo¶

This is where jq earns its place in your muscle memory. Kubernetes commands with -o json produce deeply nested structures that are painful to read raw.

# List all non-Running pods — the single most useful k8s+jq command
kubectl get pods -o json | jq -r '
  .items[]
  | select(.status.phase != "Running")
  | "\(.metadata.namespace)/\(.metadata.name) \(.status.phase)"
'

# Find pods with high restart counts
kubectl get pods -A -o json | jq -r '
  .items[]
  | select(.status.containerStatuses[]?.restartCount > 5)
  | [.metadata.namespace, .metadata.name,
     (.status.containerStatuses[0].restartCount | tostring)]
  | @tsv
'

jq concept	SQL equivalent	Example
`.items[]`	`FROM items`	Iterate the array
`select(.status == 200)`	`WHERE status = 200`	Filter
`{name, status}`	`SELECT name, status`	Pick columns
`group_by(.status)`	`GROUP BY status`	Aggregate
`sort_by(.duration)`	`ORDER BY duration`	Sort
`length`	`COUNT(*)`	Count
`map(.price) \\| add`	`SUM(price)`	Sum

map, reduce, and aggregation¶

Back to the traffic spike. Count errors by status code from JSON logs:

# Group and count by status code
cat api-gateway.log | jq -s '
  group_by(.status)
  | map({status: .[0].status, count: length})
  | sort_by(-.count)
'

[
  {"status": 200, "count": 89421},
  {"status": 429, "count": 34201},
  {"status": 500, "count": 1247}
]

That 429 count is telling — your rate limiter is firing. The 500s are the collateral damage.

Sum total request duration for the spike IP:

cat api-gateway.log | jq -s '
  map(select(.client_ip == "10.0.5.200"))
  | reduce .[] as $req (0; . + $req.duration_ms)
'
# 4829103  (total milliseconds consumed by this one client)

Remember: jq's reduce syntax: reduce .[] as $item (INIT; UPDATE). INIT is the starting accumulator. UPDATE is applied per element. Same concept as Python's functools.reduce or JavaScript's Array.reduce.

Output formats: @csv, @tsv, and raw strings¶

jq can produce output formats other than JSON — critical for feeding data into other tools.

# TSV output for spreadsheets or further awk processing
cat api-gateway.log | jq -r '
  select(.status >= 400)
  | [.timestamp, .client_ip, .path, .status, .duration_ms]
  | @tsv
'

# CSV output (properly escaped — handles commas in values)
cat api-gateway.log | jq -r '
  [.timestamp, .path, .status]
  | @csv
'

Gotcha: Forgetting -r when piping jq output to other commands is the number one jq mistake. Without -r, strings include literal quotes: "web-pod-abc" instead of web-pod-abc. Those quotes break xargs, for loops, and every tool expecting clean input.

Slurp mode: when you need the whole picture¶

By default, jq processes each line independently. The -s (slurp) flag reads all lines into a single array first — required for sorting, grouping, or counting across lines.

# Without -s: length of each individual JSON object
cat lines.jsonl | jq 'length'    # prints 5, 7, 3, 6, ...

# With -s: total number of lines
cat lines.jsonl | jq -s 'length'  # prints 4821

# Sort all entries by timestamp
cat lines.jsonl | jq -s 'sort_by(.timestamp)'

Under the Hood: jq compiles your filter into a bytecode VM that processes JSON in a single pass. For large files, this is dramatically faster than loading into Python. But -s (slurp) breaks this — it loads the entire input into memory. On a 2 GB JSONL file, slurp mode will consume 2+ GB of RAM. For large-file aggregation, consider streaming approaches or awk instead.

Flashcard Check: jq¶

Question	Answer
What does `-r` do?	Strips JSON string quotes from output (raw output).
How do you provide a default for missing fields?	The alternative operator: `.name // "unknown"`
What's the difference between `.[]` and `map()`?	`.[]` produces bare values; `map()` wraps results in an array.
How do you pass a shell variable into jq safely?	`--arg varname "$SHELL_VAR"` — never interpolate with double quotes.
What does `-s` (slurp) do?	Reads all input into a single array before filtering.
What does `-e` do?	Returns non-zero exit code when result is null or false.

Part 4: sed — The Stream Surgeon¶

You've identified the spike source. Now you need to clean and transform data for a report. This is sed's territory: surgical text transformations on a stream.

Name Origin: sed stands for stream editor. Written by Lee McMahon at Bell Labs in 1973–1974, it was designed as a non-interactive version of the ed line editor. The s/old/new/ syntax comes directly from ed's substitute command — the same syntax later influenced Perl, vim, and every modern IDE's find-and-replace.

The substitute command: 90% of what you'll use¶

# Basic substitution (first occurrence per line)
sed 's/error/ERROR/' logfile.txt

# Global: all occurrences per line
sed 's/error/ERROR/g' logfile.txt

# Case-insensitive (GNU sed)
sed 's/error/ERROR/gI' logfile.txt

Practical scenario: sanitize IPs for a report¶

Your manager wants the log excerpt but with client IPs masked for the incident report:

# Mask IPv4 addresses in the log
sed -E 's/[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/xxx.xxx.xxx.xxx/g' access.log > sanitized.log

sed -i: in-place editing (with a portability trap)¶

# Edit a config file in place — Linux (GNU sed)
sed -i 's/DEBUG=true/DEBUG=false/' .env

# Same thing on macOS (BSD sed) — requires backup extension argument
sed -i '' 's/DEBUG=true/DEBUG=false/' .env

Gotcha: This macOS vs Linux difference has burned every engineer who writes CI scripts. GNU sed's -i takes an optional backup suffix. BSD sed's -i requires a mandatory argument (even if it's an empty string). The portable fix: sed -i.bak 's/old/new/' file && rm file.bak — or just use perl -pi -e 's/old/new/' file, which works identically everywhere.

Address ranges: operate on specific lines¶

# Delete comment lines
sed '/^#/d' nginx.conf

# Extract a time window from a log
sed -n '/2026-03-22 15:30/,/2026-03-22 16:00/p' access.log

# Replace only within a specific section
sed '/\[production\]/,/\[staging\]/s/timeout=30/timeout=60/' config.ini

The hold space: sed's second clipboard¶

sed has two buffers: the pattern space (current line being processed) and the hold space (a scratch pad that persists between lines).

Command	What it does
`h`	Copy pattern space to hold space
`H`	Append pattern space to hold space
`g`	Copy hold space to pattern space
`G`	Append hold space to pattern space
`x`	Swap pattern space and hold space

# Reverse a file (tac equivalent, but POSIX)
sed -n '1!G;h;$p' file.txt

Honest advice: if you need more than basic hold-space operations, reach for awk. Hold-space gymnastics are clever but unmaintainable. The person reading your script at 3 AM (including future you) will not thank you.

Trivia: Eric Pement's "sed one-liners" collection, first published in the 1990s, became a canonical reference passed between Unix administrators for decades before Stack Overflow existed. Verifiable at sed.sourceforge.net/sed1line.txt.

Part 5: Same Problem, Three Ways¶

Here's where the tools' strengths become visceral. Same task: find the top 5 client IPs by request count.

Traditional access log (columnar text) → awk¶

awk '{ count[$1]++ } END { for (ip in count) print count[ip], ip }' access.log \
  | sort -rn | head -5

Why awk wins here: the data is columnar. $1 is the IP. No parsing needed.

JSON logs → jq¶

cat api-gateway.log | jq -r '.client_ip' | sort | uniq -c | sort -rn | head -5

Or, doing it entirely in jq (slower, but no external tools):

cat api-gateway.log | jq -s '
  group_by(.client_ip)
  | map({ip: .[0].client_ip, count: length})
  | sort_by(-.count)
  | .[:5]
'

Why jq wins here: the IP is a nested field. awk would need to parse JSON structure.

Quick extraction from messy text → sed + sort pipeline¶

Suppose your log has inconsistent formatting and you just need to yank out the IPs:

sed -En 's/.*"client_ip":"([^"]+)".*/\1/p' mixed.log \
  | sort | uniq -c | sort -rn | head -5

Why sed works here: you don't need field splitting or JSON parsing. You need one regex extraction from messy, inconsistent text.

Mental Model: awk thinks in columns. jq thinks in trees. sed thinks in patterns. The data shape determines the tool.

Part 6: Real Pipelines — Combining Tools¶

The real power emerges when you chain tools together. Each handles the stage it's best at.

Pipeline 1: JSON logs → jq extracts → awk aggregates¶

# Average response time per endpoint from JSON logs
cat api-gateway.log \
  | jq -r '[.path, .duration_ms] | @tsv' \
  | awk -F'\t' '{ sum[$1] += $2; count[$1]++ }
    END { for (p in sum) printf "%s\t%.1f ms\t(%d reqs)\n", p, sum[p]/count[p], count[p] }' \
  | sort -t$'\t' -k2 -rn \
  | head -10

Why this works: jq handles the JSON extraction and outputs clean TSV. awk handles the math (averages, counts). sort handles the ordering. Each tool does what it's best at.

Pipeline 2: kubectl → jq → sed → final output¶

# Get pod resource requests, formatted for a report
kubectl get pods -o json \
  | jq -r '.items[] | [.metadata.name, .spec.containers[0].resources.requests.cpu // "none",
     .spec.containers[0].resources.requests.memory // "none"] | @tsv' \
  | sed 's/none/⚠ NOT SET/g' \
  | column -t

Pipeline 3: Process monitoring data with awk¶

# From Prometheus text format: find metrics above threshold
curl -s http://localhost:9090/metrics \
  | awk '/^http_requests_total/ && !/^#/ {
    split($0, parts, " ")
    value = parts[2]
    if (value+0 > 10000) print $0
  }'

Pipeline 4: Cross-referencing two data sources¶

The spike IPs from the access log — are they hitting pods that are unhealthy?

# Step 1: Get the suspicious IPs
SPIKE_IPS=$(awk '{ count[$1]++ } END {
  for (ip in count) if (count[ip] > 100000) print ip
}' access.log)

# Step 2: Check which pods those IPs are hitting (from JSON logs)
for ip in $SPIKE_IPS; do
  echo "=== $ip ==="
  cat api-gateway.log \
    | jq -r --arg ip "$ip" 'select(.client_ip == $ip) | .path' \
    | sort | uniq -c | sort -rn | head -5
done

# Step 3: Check pod health for those endpoints
kubectl get pods -o json | jq -r '
  .items[]
  | select(.status.containerStatuses[]?.restartCount > 0)
  | [.metadata.name, .status.phase,
     (.status.containerStatuses[0].restartCount | tostring)]
  | @tsv
' | column -t

Part 7: Performance — When awk Beats Python (and When It Doesn't)¶

The benchmark everyone should see¶

Processing a 10 GB access log to count requests per IP:

Approach	Time	Memory
`awk '{c[$1]++} END {for(i in c) print c[i],i}'`	~3 min	~800 KB
`sort \\| uniq -c`	~8 min	~200 KB (disk-backed sort)
Python with `readlines()`	~45 min	~12 GB
Python with line-by-line read	~15 min	~50 MB

War Story: A junior engineer wrote a Python script to parse a 50 GB access log and count requests per IP. It took 45 minutes and consumed 12 GB of RAM before finishing. The equivalent awk one-liner ran in 3 minutes using 800 KB. The lesson: awk (and mawk) are purpose-built for this shape of work. Save Python for when you need libraries, complex data structures, or error handling beyond what awk provides.

When Python wins¶

Complex JSON transformations with error handling and retries
Data that needs to be joined across multiple files with different schemas
Anything involving HTTP requests, databases, or third-party APIs
When the "script" grows beyond ~10 lines of awk

mawk vs gawk: it matters at scale¶

# Check which awk you have
awk --version 2>/dev/null || awk -V 2>/dev/null

Implementation	Default on	Speed	Features
gawk (GNU awk)	Red Hat, Fedora, Arch	Baseline	Full: networking, time functions, `match()` captures
mawk (Mike's awk)	Debian, Ubuntu	2–10x faster	Minimal: no `match()` captures, no networking

Trivia: mawk's speed advantage comes from using a bytecode interpreter instead of gawk's tree-walking interpreter. For small files, the difference is negligible. For multi-gigabyte log processing, mawk can be 10x faster. Michael Brennan wrote mawk in the early 1990s, and it remains the default awk on Debian-derived distributions.

Part 8: Processing CSV/TSV Data¶

Real operational data comes in CSV and TSV constantly — cloud billing exports, inventory lists, monitoring data exports.

awk for CSV (with caveats)¶

# Sum the third column of a CSV (skip header)
awk -F, 'NR > 1 { sum += $3 } END { printf "Total: $%.2f\n", sum }' invoices.csv

# Filter rows where status column contains "failed"
awk -F, '$4 ~ /failed/' jobs.csv

# Reformat: swap columns, change delimiter
awk -F, 'BEGIN { OFS="\t" } NR > 1 { print $3, $1, $2 }' data.csv > data.tsv

Gotcha: awk's CSV parsing is naive — it splits on every comma, including commas inside quoted fields. "Smith, John",42,admin breaks into four fields instead of three. For proper CSV with quoted fields, use gawk with FPAT or switch to Python's csv module. For quick-and-dirty ops work where you control the data, awk is fine.

jq for CSV/TSV output¶

# Convert JSON array to CSV
echo '[{"name":"web","cpu":"500m"},{"name":"api","cpu":"1000m"}]' \
  | jq -r '.[] | [.name, .cpu] | @csv'
# "web","500m"
# "api","1000m"

# TSV (no quoting, tab-separated)
echo '[{"name":"web","cpu":"500m"}]' \
  | jq -r '.[] | [.name, .cpu] | @tsv'
# web   500m

sed for CSV cleanup¶

# Remove Windows carriage returns from a CSV
sed 's/\r$//' windows_export.csv > clean.csv

# Strip surrounding quotes from every field
sed 's/"//g' quoted.csv

# Replace delimiter: semicolons to commas (European CSV exports)
sed 's/;/,/g' european.csv > standard.csv

Part 9: The Decision Tree¶

You're staring at data. Which tool do you reach for?

Is the data JSON?
├── Yes → jq
│   ├── Simple field extraction → jq '.field'
│   ├── Filtering → jq 'select()'
│   ├── Aggregation → jq -s 'group_by() | map()'
│   └── Need to feed into other tools → jq -r '... | @tsv' | awk/sort/...
│
├── No → Is it columnar (fixed fields per line)?
│   ├── Yes → awk
│   │   ├── Extract fields → awk '{print $2, $5}'
│   │   ├── Filter + extract → awk '/pattern/ {print $1}'
│   │   ├── Count/sum/average → awk '{sum+=$3} END {print sum}'
│   │   └── Group by key → awk '{c[$1]++} END {for(k in c) print k,c[k]}'
│   │
│   └── No → Is it a quick text transformation?
│       ├── Yes → sed
│       │   ├── Find-and-replace → sed 's/old/new/g'
│       │   ├── Delete lines → sed '/pattern/d'
│       │   ├── Extract range → sed -n '/start/,/end/p'
│       │   └── In-place config edit → sed -i 's/old/new/' file
│       │
│       └── Complex/multi-format → Write a Python script

Interview Bridge: "When would you use sed vs awk vs jq?" is a common DevOps interview question. The clean answer: jq for JSON, awk for columnar data and aggregation, sed for simple substitutions and line surgery. When the task grows beyond ~10 lines, switch to Python. Showing you know when to use each tool matters more than memorizing syntax.

Exercises¶

Exercise 1: Quick Win (2 minutes)¶

Given this JSON, extract just the names of pods that are not "Running":

{"items":[{"metadata":{"name":"web-abc"},"status":{"phase":"Running"}},{"metadata":{"name":"api-def"},"status":{"phase":"CrashLoopBackOff"}},{"metadata":{"name":"db-ghi"},"status":{"phase":"Running"}}]}

Solution

echo '{"items":[{"metadata":{"name":"web-abc"},"status":{"phase":"Running"}},{"metadata":{"name":"api-def"},"status":{"phase":"CrashLoopBackOff"}},{"metadata":{"name":"db-ghi"},"status":{"phase":"Running"}}]}' \
  | jq -r '.items[] | select(.status.phase != "Running") | .metadata.name'
# api-def

Exercise 2: awk Aggregation (5 minutes)¶

You have an access log line format: IP - - [timestamp] "METHOD /path HTTP/1.1" STATUS BYTES

Write an awk one-liner that prints the total bytes served per unique IP address, sorted by bytes descending. Field 10 is the byte count.

Hint

Use an associative array keyed on `$1` (the IP), accumulating `$10`.

Solution

awk '{ bytes[$1] += $10 } END { for (ip in bytes) print bytes[ip], ip }' access.log \
  | sort -rn | head -10

Exercise 3: Pipeline Composition (10 minutes)¶

Your JSON API logs contain {"timestamp":"...","status":200,"path":"/api/v2/products","duration_ms":42}.

Build a pipeline that: 1. Extracts only requests to paths starting with /api/v2/ 2. Groups by path 3. Shows average duration and request count per path 4. Outputs as a formatted table

Hint

Use jq to filter and extract TSV, then awk for aggregation, then `column -t` for formatting.

Solution

cat api-gateway.log \
  | jq -r 'select(.path | startswith("/api/v2/")) | [.path, .duration_ms] | @tsv' \
  | awk -F'\t' '{ sum[$1]+=$2; n[$1]++ }
    END { printf "%-30s %10s %10s\n", "PATH", "AVG_MS", "COUNT"
          for (p in sum) printf "%-30s %10.1f %10d\n", p, sum[p]/n[p], n[p] }' \
  | sort -k2 -rn

Exercise 4: sed Surgery (5 minutes)¶

You have a config file with lines like # ENABLE_FEATURE_X=true (commented out). Write a sed command that uncomments only the line containing FEATURE_X without affecting any other commented lines.

Solution

sed -i 's/^# \(ENABLE_FEATURE_X=.*\)/\1/' config.file
# Or with -E:
sed -Ei 's/^# (ENABLE_FEATURE_X=.*)/\1/' config.file

Cheat Sheet¶

jq¶

Pattern	Command
Pretty-print	`jq .`
Extract field	`jq '.metadata.name'`
Raw output (no quotes)	`jq -r '.name'`
Filter array	`jq '.[] \\| select(.status != "Running")'`
Transform all elements	`jq 'map(.name)'`
Construct new object	`jq '{name: .metadata.name, phase: .status.phase}'`
Default for missing field	`jq '.timeout // 30'`
Count items	`jq -s 'length'`
Group and count	`jq -s 'group_by(.key) \\| map({key: .[0].key, n: length})'`
CSV output	`jq -r '[.f1, .f2] \\| @csv'`
TSV output	`jq -r '[.f1, .f2] \\| @tsv'`
Pass shell variable	`jq --arg v "$VAR" 'select(.name == $v)'`

awk¶

Pattern	Command
Print a field	`awk '{print $2}'`
Set delimiter	`awk -F: '{print $1}'`
Filter by pattern	`awk '/ERROR/ {print}'`
Filter by field value	`awk '$3 > 100'`
Count by key	`awk '{c[$1]++} END {for(k in c) print k,c[k]}'`
Sum a column	`awk '{s+=$3} END {print s}'`
Average	`awk '{s+=$3;n++} END {print s/n}'`
Skip header	`awk 'NR > 1'`
Deduplicate	`awk '!seen[$0]++'`
Printf formatting	`awk '{printf "%-20s %5d\n", $1, $2}'`

sed¶

Pattern	Command
Substitute (first per line)	`sed 's/old/new/'`
Substitute (all per line)	`sed 's/old/new/g'`
In-place edit (portable)	`sed -i.bak 's/old/new/g' file && rm file.bak`
Delete matching lines	`sed '/pattern/d'`
Print only matches	`sed -n '/pattern/p'`
Extract line range	`sed -n '10,20p'`
Extract pattern range	`sed -n '/START/,/END/p'`
Use different delimiter	`sed 's\\|/old/path\\|/new/path\\|g'`
Extended regex	`sed -E 's/(group)/\1/'`
Delete blank lines	`sed '/^$/d'`

Takeaways¶

Match tool to data shape. jq for JSON, awk for columns, sed for text surgery. Forcing the wrong tool wastes time and produces fragile commands.
awk's mental model in one sentence: "For each line, if pattern matches, do action." Associative arrays make it a one-pass aggregation engine.
jq is SQL for JSON. .items[] = FROM, select() = WHERE, {name, status} = SELECT. Learn these three and you handle 90% of kubectl/API work.
sed is a stream surgeon, not a programmer. Substitution and line-level operations are its strength. The hold space exists, but if you need it, you probably need awk instead.
Pipelines beat monoliths. jq extracts → awk aggregates → sort orders → head limits. Each tool handles one stage. This composability is the Unix philosophy in action.
awk processes gigabyte files in constant memory. Single-pass, line-at-a-time processing means awk handles files that would crash a naive Python script. Know when to reach for it.

strace: Reading the Matrix — when the problem is below your application and you need to see the syscall conversation
What Happens When You Type a Regex — understand NFA vs DFA backtracking and why some regex patterns are dangerous
What Happens Inside a Linux Pipe — the kernel mechanics behind every | in your pipelines
Why Everything Uses JSON Now — the history and tradeoffs of JSON as the universal data format

Text Processing: jq, awk, and sed in the Trenches

The Mission¶

Part 1: The Right Tool for the Shape of the Data¶

Part 2: awk — "For Each Line, If Pattern Matches, Do Action"¶

The mental model¶

Find the top talkers¶

Narrow it down: what are they hitting?¶

Time-based analysis: when did it start?¶

awk's built-in variables — the cheat code¶

Flashcard Check: awk¶

Part 3: jq — SQL for JSON¶

Dot notation: the gateway drug¶

select: filtering arrays¶

The kubectl + jq power combo¶

map, reduce, and aggregation¶

Output formats: @csv, @tsv, and raw strings¶

Slurp mode: when you need the whole picture¶

Flashcard Check: jq¶

Part 4: sed — The Stream Surgeon¶

The substitute command: 90% of what you'll use¶

Practical scenario: sanitize IPs for a report¶

sed -i: in-place editing (with a portability trap)¶

Address ranges: operate on specific lines¶

The hold space: sed's second clipboard¶

Part 5: Same Problem, Three Ways¶

Traditional access log (columnar text) → awk¶

JSON logs → jq¶

Quick extraction from messy text → sed + sort pipeline¶

Part 6: Real Pipelines — Combining Tools¶

Pipeline 1: JSON logs → jq extracts → awk aggregates¶

Pipeline 2: kubectl → jq → sed → final output¶

Pipeline 3: Process monitoring data with awk¶

Pipeline 4: Cross-referencing two data sources¶

Part 7: Performance — When awk Beats Python (and When It Doesn't)¶

The benchmark everyone should see¶

When Python wins¶

mawk vs gawk: it matters at scale¶

Part 8: Processing CSV/TSV Data¶

awk for CSV (with caveats)¶

jq for CSV/TSV output¶

sed for CSV cleanup¶

Part 9: The Decision Tree¶

Exercises¶

Exercise 1: Quick Win (2 minutes)¶

Exercise 2: awk Aggregation (5 minutes)¶

Exercise 3: Pipeline Composition (10 minutes)¶

Exercise 4: sed Surgery (5 minutes)¶

Cheat Sheet¶

jq¶

awk¶

sed¶

Takeaways¶

Related Lessons¶

Pages that link here¶