awk — Street-Level Ops¶
Real-world awk workflows for log parsing, data extraction, metrics calculation, and CLI tool output processing in production.
Log Parsing One-Liners¶
# Extract timestamps and error messages from syslog
awk '/ERROR/ { print $1, $2, $3, $NF }' /var/log/syslog
# Count errors per hour from timestamped logs (format: 2024-03-15T14:23:01)
awk -F'T' '/ERROR/ { split($2, t, ":"); hours[$1"T"t[1]]++ }
END { for (h in hours) print h, hours[h] }' app.log | sort
# Output:
# 2024-03-15T14 23
# 2024-03-15T15 7
# 2024-03-15T16 45
# Extract unique IP addresses from access logs
awk '{ print $1 }' /var/log/nginx/access.log | sort -u
# Top 10 IPs by request count
awk '{ ips[$1]++ } END { for (ip in ips) print ips[ip], ip }' \
/var/log/nginx/access.log | sort -rn | head -10
# Output:
# 4521 10.0.1.50
# 3892 10.0.1.51
# 1247 192.168.1.100
# Average response time from nginx logs (assuming response time is the last field)
awk '{ sum += $NF; count++ } END { printf "Avg: %.3f ms\n", sum/count }' access.log
# Request count by HTTP status code
awk '{ status[$9]++ } END { for (s in status) print s, status[s] }' \
/var/log/nginx/access.log | sort
# Output:
# 200 45231
# 301 892
# 404 127
# 500 23
# Find slow requests (response time > 2 seconds, time in last field)
awk '$NF > 2.0 { print $4, $7, $NF"s" }' access.log
# Extract failed SSH login attempts
awk '/Failed password/ { print $(NF-3) }' /var/log/auth.log | sort | uniq -c | sort -rn
Extracting Fields from CSV/TSV¶
# Print column 2 from a CSV (naive — does not handle quoted commas)
awk -F, '{ print $2 }' data.csv
# Print columns 1 and 3 as TSV
awk -F, 'BEGIN { OFS="\t" } { print $1, $3 }' data.csv
# Skip header line
awk -F, 'NR > 1 { print $2 }' data.csv
# Filter rows where column 3 exceeds a threshold
awk -F, 'NR==1 || $3 > 1000' sales.csv
# Sum a numeric column (column 4)
awk -F, 'NR > 1 { sum += $4 } END { printf "Total: $%.2f\n", sum }' invoices.csv
# Group by a column and sum another
awk -F, 'NR > 1 { totals[$2] += $4 }
END { for (dept in totals) printf "%-20s $%.2f\n", dept, totals[dept] }' expenses.csv
# Output:
# Engineering $45230.00
# Marketing $23100.50
# Operations $18900.00
Parsing /etc/passwd¶
# List all users and their shells
awk -F: '{ printf "%-20s %s\n", $1, $7 }' /etc/passwd
# Find users with bash shell
awk -F: '$7 ~ /bash/ { print $1 }' /etc/passwd
# Find users with UID >= 1000 (real users)
awk -F: '$3 >= 1000 { print $1, $3, $7 }' /etc/passwd
# Find users with no shell (service accounts)
awk -F: '$7 == "/usr/sbin/nologin" || $7 == "/bin/false" { print $1 }' /etc/passwd
# Count users per shell
awk -F: '{ shells[$7]++ } END { for (s in shells) printf "%-30s %d\n", s, shells[s] }' /etc/passwd
# Output:
# /bin/bash 5
# /usr/sbin/nologin 23
# /bin/false 3
Summing Columns from Command Output¶
# Total disk usage from df (used KB, column 3)
df -k | awk 'NR > 1 { sum += $3 } END { printf "Total used: %.1f GB\n", sum/1048576 }'
# Total memory used by a process group
ps aux | awk '/nginx/ { sum += $6 } END { printf "nginx RSS: %.1f MB\n", sum/1024 }'
# Total container memory from docker stats
docker stats --no-stream --format "{{.MemUsage}}" | \
awk -F/ '{ gsub(/[^0-9.]/, "", $1); sum += $1 } END { printf "Total: %.1f MiB\n", sum }'
# Count pods by status
kubectl get pods --no-headers | awk '{ status[$3]++ } END { for (s in status) print s, status[s] }'
# Output:
# Running 12
# CrashLoopBackOff 2
# Completed 5
Processing CLI Tool Output¶
# kubectl: extract restart counts for crashing pods
kubectl get pods --no-headers | awk '$4 > 3 { print $1, "restarts:", $4 }'
# docker: list images sorted by size
docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | \
awk -F'\t' '{ print $2, $1 }' | sort -h
# aws: extract instance IDs and private IPs from text output
aws ec2 describe-instances --output text | \
awk '/^INSTANCES/ { id=$8; ip=$14 } /^TAGS.*Name/ { print $3, id, ip }'
# terraform output parsing
terraform output -json | jq -r 'to_entries[] | [.key, (.value.value | tostring)] | @tsv' | \
awk -F'\t' '{ printf "%-30s %s\n", $1, $2 }'
# Parse helm release info
helm list -A --output json | jq -r '.[] | [.name, .namespace, .status] | @tsv' | \
awk -F'\t' '{ printf "%-25s %-15s %s\n", $1, $2, $3 }'
# Parse ss/netstat output for listening ports
ss -tlnp | awk 'NR > 1 { split($4, a, ":"); print a[length(a)], $NF }' | sort -n
# Combine awk with sort for top-N analysis
journalctl --since "1 hour ago" --no-pager | \
awk '/error|fail/i { units[$5]++ }
END { for (u in units) print units[u], u }' | sort -rn | head -10
Quick Metrics with awk¶
# Calculate request rate per second from timestamps
awk '{ split($4, t, ":"); secs[t[1]":"t[2]":"t[3]]++ }
END { for (s in secs) print s, secs[s] }' access.log | sort | tail -5
# Percentile calculation (p50, p95, p99)
# First sort the data, then use awk
sort -n response_times.txt | awk '{
a[NR] = $1
}
END {
printf "p50: %.3f\n", a[int(NR*0.50)]
printf "p95: %.3f\n", a[int(NR*0.95)]
printf "p99: %.3f\n", a[int(NR*0.99)]
}'
# Histogram of response times (bucket by 100ms)
awk '{ bucket = int($1 * 10) / 10; hist[bucket]++ }
END { for (b in hist) printf "%6.1f |%s %d\n", b, sprintf("%*s", hist[b], ""), hist[b] | "sort -n" }' \
response_times.txt
# Bytes transferred per endpoint
awk '{ bytes[$7] += $10 }
END { for (ep in bytes) printf "%10d %s\n", bytes[ep], ep }' access.log | sort -rn | head -10
# Connection rate by minute
awk '{ split($4, t, "[/:"); min = t[2]":"t[3]":"t[4]; counts[min]++ }
END { for (m in counts) print m, counts[m] }' access.log | sort | tail -10
Data Transformation and Reporting¶
# Pivot table: count by two dimensions
awk '{ key = $1 SUBSEP $2; count[key]++ }
END { for (k in count) {
split(k, parts, SUBSEP)
printf "%-15s %-15s %d\n", parts[1], parts[2], count[k]
}}' events.log
# Generate a markdown table from TSV
awk -F'\t' 'NR==1 { for (i=1;i<=NF;i++) printf "| %s ", $i; print "|"
for (i=1;i<=NF;i++) printf "|---"; print "|" }
NR>1 { for (i=1;i<=NF;i++) printf "| %s ", $i; print "|" }' data.tsv
# JSON-lines to TSV extraction
awk -F'"' '/\"status\"/ { for (i=1;i<=NF;i++) if ($i=="status") print $(i+2) }' events.jsonl | sort | uniq -c
# Dedup by key, keeping last occurrence
awk -F'\t' '{ data[$1] = $0 } END { for (k in data) print data[k] }' events.tsv
Power One-Liners¶
Curated awk one-liners with detailed breakdowns. These teach core awk patterns you will use constantly.
Deduplicate lines without sorting¶
Breakdown: seen is an associative array keyed by the full line ($0). On first encounter the value is 0 (falsy after !), so the line prints. The ++ increments it, so subsequent identical lines are truthy after ! and suppressed. This preserves original order unlike sort -u.
[!TIP] When to use: Deduplicating log entries, config lists, or CSV exports where order matters.
Memory usage per user (aggregate from ps)¶
ps aux | awk '{mem[$1]+=$4} END {for (u in mem) printf "%-12s %.1f%%\n", u, mem[u]}' | sort -t'%' -k2 -rn
Breakdown: $1 is username, $4 is %MEM. Accumulates memory per user into associative array mem. END block iterates all users, printf formats with padding. Piped to sort for descending order.
[!TIP] When to use: Quick triage during memory pressure — who's the hog?
Graph connections per host (ASCII histogram)¶
netstat -an | grep ESTABLISHED | awk '{print $5}' | awk -F: '{print $1}' | sort | uniq -c | awk '{printf "%-20s %5d ", $2, $1; for(i=0;i<$1;i++) printf "*"; print ""}'
Breakdown: Three-stage awk pipeline: (1) extract remote address field, (2) strip port with : as field separator, (3) format into aligned histogram where each * = one connection. The printf with %-20s left-aligns IPs.
[!TIP] When to use: Investigating connection storms, DDoS patterns, or connection pooling issues.
Display text between two patterns¶
Breakdown: AWK range pattern — when a line matches the first regex, awk starts printing. It continues until a line matches the second regex (inclusive). No variables, no state management needed.
[!TIP] When to use: Extracting config blocks, log windows between timestamps, or function bodies from source files.
Sum a column of numbers¶
Breakdown: Accumulator pattern — sum starts at 0 (awk default for uninitialized). Each line adds field 1. END fires once after all input is consumed. Change $1 to $N for any column.
[!TIP] When to use: Summing byte counts, request durations, error counts from log extracts.
Exclude specific columns¶
Breakdown: Nullifies fields 2 and 4, then prints the whole record. OFS (output field separator) collapses them. Note: this changes spacing — pipe through column -t to re-align.
[!TIP] When to use: Stripping sensitive columns (passwords, tokens) from tabular output before sharing.
Analyze failed SSH logins from auth logs¶
sudo zcat /var/log/auth.log.*.gz | awk '/Failed password/ && !/invalid user/ {a[$9]++} /Failed password for invalid user/ {a["*"$11]++} END {for (i in a) printf "%6d\t%s\n", a[i], i}' | sort -rn | head -20
Breakdown: Two pattern-action pairs: valid usernames go into a[$9], invalid users (prefixed with *) into a[$11] — field positions differ because "invalid user" adds extra words. END block formats counts. sort -rn puts worst offenders first.
[!TIP] When to use: Post-incident forensics, brute-force detection, fail2ban tuning.
Apache/Nginx top talkers from access log¶
Breakdown: Classic frequency counter — $1 is the client IP in Common Log Format. Associative array ip counts hits per IP. Same pattern works for any field: $7 for URLs, $9 for status codes.
[!TIP] When to use: Identifying abusive IPs, traffic spikes, or bot activity.
Generate ASCII histogram from log timestamps¶
awk '{print substr($0,1,15)}' /var/log/syslog | uniq -c | sort -rn | head -20 | awk '{printf "%5d %s ", $1, $2; for(i=0;i<$1/10;i++) printf "#"; print ""}'
Breakdown: First awk extracts the timestamp prefix (first 15 chars covers Mon DD HH:MM:SS). uniq -c counts consecutive identical timestamps. Second awk draws bars scaled by /10 to keep width manageable.
[!TIP] When to use: Visualizing log bursts, correlating error spikes with events.
Find long lines in files (style/config auditing)¶
awk 'length > 120 {printf "%s:%d (%d chars): %s\n", FILENAME, NR, length, substr($0,1,80)"..."}' *.conf
Breakdown: length with no argument returns length of $0. FILENAME and NR are built-in — file name and line number. substr truncates the preview. Processes multiple files in one pass.
[!TIP] When to use: Enforcing line-length limits in config files, finding runaway log lines.
User/group relationship diagram (graphviz)¶
awk -F: 'BEGIN {print "digraph groups {"} {split($4,members,","); for(m in members) printf " \"%s\" -> \"%s\"\n", members[m], $1} END {print "}"}' /etc/group | dot -Tpng -o groups.png
Breakdown: BEGIN opens the digraph. Main body splits the member list (field 4, comma-delimited) and creates edges from each member to the group name (field 1). END closes the graph. Piped to graphviz dot.
[!TIP] When to use: Auditing group memberships, visualizing RBAC structures, onboarding documentation.
Transpose rows to columns¶
Breakdown: Builds array a where each index i accumulates the i-th field from every row. After all input, prints each accumulated row (which is now a column). NF = number of fields.
[!TIP] When to use: Reformatting monitoring output, pivoting CSV data, preparing data for plotting.
Print line BEFORE a regex match¶
Breakdown: Every line, prev stores the current line. When the next line matches, we print prev — which is the line before the match. This is the awk equivalent of grep -B1 but without needing GNU grep.
[!TIP] When to use: Finding what happened right before an error in a log, getting context before a pattern.
Print line AFTER a regex match¶
Breakdown: getline reads the next input line into $0, advancing NR. So when the regex matches, we immediately consume the next line and print it. Subtle: if the next line also matches, its successor won't be printed (getline already consumed it).
[!TIP] When to use: Extracting values that appear on the line after a label/header in structured output.
Reverse field order on each line¶
Breakdown: Loops from the last field (NF) down to 1, printing each with a space. The final print "" adds the newline. This reverses column order without touching line order.
[!TIP] When to use: Swapping column order in logs, reversing CSV fields, reformatting data for different consumers.
Join every N lines into one (comma-separated)¶
Breakdown: ORS (Output Record Separator) is dynamically set each line. If NR%5 is nonzero, ORS becomes , (joining to previous). Every 5th line, ORS becomes \n (ending the group). This is awk at its most terse — the entire program is one assignment that doubles as the condition-action.
[!TIP] When to use: Collapsing multi-line records into CSV rows, reformatting list output for spreadsheets, batching items.
Find the maximum value and its line¶
Breakdown: Compares field 1 of each line against running max. When a new max is found, stores both the value and the full line. END prints the winner. Change $1 to any field. Change > to < for minimum.
[!TIP] When to use: Finding the slowest request in a log, the largest file in
duoutput, the highest-latency host.
Count all words in a file (wc -w alternative)¶
Breakdown: NF (Number of Fields) is the word count per line. Accumulate across all lines, print at end. Simpler and more composable than wc -w in pipelines.
[!TIP] When to use: Quick word counts, validating data density, checking if a file has content.
Count pattern occurrences¶
Breakdown: Increments counter on each match. n+0 in the END block ensures 0 is printed (not blank) when there are no matches — because uninitialized awk variables are empty strings, and +0 forces numeric context.
[!TIP] When to use: Counting errors, warnings, or specific events in logs. The
+0trick is a classic awk idiom worth internalizing.
Squeeze whitespace (normalize fields)¶
Breakdown: Assigning $1=$1 forces awk to reconstruct $0 using OFS (default: single space). This collapses all runs of whitespace (tabs, multiple spaces) into single spaces. The 1 is a true condition that triggers the default action (print). This is the most cryptic-looking yet useful awk idiom.
[!TIP] When to use: Normalizing messy output from
ps,df,mount, or any command with variable-width columns before further processing.
Reverse all lines in a file (tac alternative)¶
Breakdown: Stores every line in array a indexed by line number. END block walks backward from NR to 1. This is tac in pure awk — works on systems where tac isn't installed (macOS, some BSDs).
[!TIP] When to use: Reading logs bottom-up (most recent first), reversing sort order, processing files in reverse.
Print lines between line numbers (inclusive)¶
Breakdown: Range pattern using line numbers instead of regexes. Activates at line 8, deactivates after line 12. Cleaner than sed -n '8,12p' because awk's range syntax is more readable and composable.
[!TIP] When to use: Extracting a known section from a file by line number, debugging specific log ranges.
Multi-pattern substitution (normalize variants)¶
Breakdown: gsub with alternation (|) replaces ALL variants in one pass. The 1 triggers print. Unlike sed which needs separate s/// commands for each pattern, awk handles alternation natively in one gsub.
[!TIP] When to use: Normalizing log levels (warn/warning/WARN to WARNING), standardizing status codes, cleaning up inconsistent data.
Join backslash-continued lines¶
Breakdown: If a line ends with \, strip the backslash, read the next line into variable t via getline, concatenate and print. next skips to the next input line. The trailing 1 prints non-continued lines normally.
[!TIP] When to use: Processing Makefiles, shell scripts, or config files that use
\continuation. Useful for feeding to tools that don't understand continuation.