awk — Street-Level Ops¶

Real-world awk workflows for log parsing, data extraction, metrics calculation, and CLI tool output processing in production.

Log Parsing One-Liners¶

# Extract timestamps and error messages from syslog
awk '/ERROR/ { print $1, $2, $3, $NF }' /var/log/syslog

# Count errors per hour from timestamped logs (format: 2024-03-15T14:23:01)
awk -F'T' '/ERROR/ { split($2, t, ":"); hours[$1"T"t[1]]++ }
    END { for (h in hours) print h, hours[h] }' app.log | sort

# Output:
# 2024-03-15T14 23
# 2024-03-15T15 7
# 2024-03-15T16 45

# Extract unique IP addresses from access logs
awk '{ print $1 }' /var/log/nginx/access.log | sort -u

# Top 10 IPs by request count
awk '{ ips[$1]++ } END { for (ip in ips) print ips[ip], ip }' \
  /var/log/nginx/access.log | sort -rn | head -10

# Output:
# 4521 10.0.1.50
# 3892 10.0.1.51
# 1247 192.168.1.100

# Average response time from nginx logs (assuming response time is the last field)
awk '{ sum += $NF; count++ } END { printf "Avg: %.3f ms\n", sum/count }' access.log

# Request count by HTTP status code
awk '{ status[$9]++ } END { for (s in status) print s, status[s] }' \
  /var/log/nginx/access.log | sort

# Output:
# 200 45231
# 301 892
# 404 127
# 500 23

# Find slow requests (response time > 2 seconds, time in last field)
awk '$NF > 2.0 { print $4, $7, $NF"s" }' access.log

# Extract failed SSH login attempts
awk '/Failed password/ { print $(NF-3) }' /var/log/auth.log | sort | uniq -c | sort -rn

Extracting Fields from CSV/TSV¶

# Print column 2 from a CSV (naive — does not handle quoted commas)
awk -F, '{ print $2 }' data.csv

# Print columns 1 and 3 as TSV
awk -F, 'BEGIN { OFS="\t" } { print $1, $3 }' data.csv

# Skip header line
awk -F, 'NR > 1 { print $2 }' data.csv

# Filter rows where column 3 exceeds a threshold
awk -F, 'NR==1 || $3 > 1000' sales.csv

# Sum a numeric column (column 4)
awk -F, 'NR > 1 { sum += $4 } END { printf "Total: $%.2f\n", sum }' invoices.csv

# Group by a column and sum another
awk -F, 'NR > 1 { totals[$2] += $4 }
    END { for (dept in totals) printf "%-20s $%.2f\n", dept, totals[dept] }' expenses.csv

# Output:
# Engineering          $45230.00
# Marketing            $23100.50
# Operations           $18900.00

Parsing /etc/passwd¶

# List all users and their shells
awk -F: '{ printf "%-20s %s\n", $1, $7 }' /etc/passwd

# Find users with bash shell
awk -F: '$7 ~ /bash/ { print $1 }' /etc/passwd

# Find users with UID >= 1000 (real users)
awk -F: '$3 >= 1000 { print $1, $3, $7 }' /etc/passwd

# Find users with no shell (service accounts)
awk -F: '$7 == "/usr/sbin/nologin" || $7 == "/bin/false" { print $1 }' /etc/passwd

# Count users per shell
awk -F: '{ shells[$7]++ } END { for (s in shells) printf "%-30s %d\n", s, shells[s] }' /etc/passwd

# Output:
# /bin/bash                      5
# /usr/sbin/nologin              23
# /bin/false                     3

Summing Columns from Command Output¶

# Total disk usage from df (used KB, column 3)
df -k | awk 'NR > 1 { sum += $3 } END { printf "Total used: %.1f GB\n", sum/1048576 }'

# Total memory used by a process group
ps aux | awk '/nginx/ { sum += $6 } END { printf "nginx RSS: %.1f MB\n", sum/1024 }'

# Total container memory from docker stats
docker stats --no-stream --format "{{.MemUsage}}" | \
  awk -F/ '{ gsub(/[^0-9.]/, "", $1); sum += $1 } END { printf "Total: %.1f MiB\n", sum }'

# Count pods by status
kubectl get pods --no-headers | awk '{ status[$3]++ } END { for (s in status) print s, status[s] }'

# Output:
# Running 12
# CrashLoopBackOff 2
# Completed 5

Processing CLI Tool Output¶

# kubectl: extract restart counts for crashing pods
kubectl get pods --no-headers | awk '$4 > 3 { print $1, "restarts:", $4 }'

# docker: list images sorted by size
docker images --format "{{.Repository}}:{{.Tag}}\t{{.Size}}" | \
  awk -F'\t' '{ print $2, $1 }' | sort -h

# aws: extract instance IDs and private IPs from text output
aws ec2 describe-instances --output text | \
  awk '/^INSTANCES/ { id=$8; ip=$14 } /^TAGS.*Name/ { print $3, id, ip }'

# terraform output parsing
terraform output -json | jq -r 'to_entries[] | [.key, (.value.value | tostring)] | @tsv' | \
  awk -F'\t' '{ printf "%-30s %s\n", $1, $2 }'

# Parse helm release info
helm list -A --output json | jq -r '.[] | [.name, .namespace, .status] | @tsv' | \
  awk -F'\t' '{ printf "%-25s %-15s %s\n", $1, $2, $3 }'

# Parse ss/netstat output for listening ports
ss -tlnp | awk 'NR > 1 { split($4, a, ":"); print a[length(a)], $NF }' | sort -n

# Combine awk with sort for top-N analysis
journalctl --since "1 hour ago" --no-pager | \
  awk '/error|fail/i { units[$5]++ }
       END { for (u in units) print units[u], u }' | sort -rn | head -10

Quick Metrics with awk¶

# Calculate request rate per second from timestamps
awk '{ split($4, t, ":"); secs[t[1]":"t[2]":"t[3]]++ }
     END { for (s in secs) print s, secs[s] }' access.log | sort | tail -5

# Percentile calculation (p50, p95, p99)
# First sort the data, then use awk
sort -n response_times.txt | awk '{
    a[NR] = $1
}
END {
    printf "p50: %.3f\n", a[int(NR*0.50)]
    printf "p95: %.3f\n", a[int(NR*0.95)]
    printf "p99: %.3f\n", a[int(NR*0.99)]
}'

# Histogram of response times (bucket by 100ms)
awk '{ bucket = int($1 * 10) / 10; hist[bucket]++ }
     END { for (b in hist) printf "%6.1f |%s %d\n", b, sprintf("%*s", hist[b], ""), hist[b] | "sort -n" }' \
  response_times.txt

# Bytes transferred per endpoint
awk '{ bytes[$7] += $10 }
     END { for (ep in bytes) printf "%10d  %s\n", bytes[ep], ep }' access.log | sort -rn | head -10

# Connection rate by minute
awk '{ split($4, t, "[/:"); min = t[2]":"t[3]":"t[4]; counts[min]++ }
     END { for (m in counts) print m, counts[m] }' access.log | sort | tail -10

Data Transformation and Reporting¶

# Pivot table: count by two dimensions
awk '{ key = $1 SUBSEP $2; count[key]++ }
     END { for (k in count) {
         split(k, parts, SUBSEP)
         printf "%-15s %-15s %d\n", parts[1], parts[2], count[k]
     }}' events.log

# Generate a markdown table from TSV
awk -F'\t' 'NR==1 { for (i=1;i<=NF;i++) printf "| %s ", $i; print "|"
                     for (i=1;i<=NF;i++) printf "|---"; print "|" }
            NR>1  { for (i=1;i<=NF;i++) printf "| %s ", $i; print "|" }' data.tsv

# JSON-lines to TSV extraction
awk -F'"' '/\"status\"/ { for (i=1;i<=NF;i++) if ($i=="status") print $(i+2) }' events.jsonl | sort | uniq -c

# Dedup by key, keeping last occurrence
awk -F'\t' '{ data[$1] = $0 } END { for (k in data) print data[k] }' events.tsv

Power One-Liners¶

Curated awk one-liners with detailed breakdowns. These teach core awk patterns you will use constantly.

Deduplicate lines without sorting¶

awk '!seen[$0]++' file.txt

Breakdown: seen is an associative array keyed by the full line ($0). On first encounter the value is 0 (falsy after !), so the line prints. The ++ increments it, so subsequent identical lines are truthy after ! and suppressed. This preserves original order unlike sort -u.

[!TIP] When to use: Deduplicating log entries, config lists, or CSV exports where order matters.

Memory usage per user (aggregate from ps)¶

ps aux | awk '{mem[$1]+=$4} END {for (u in mem) printf "%-12s %.1f%%\n", u, mem[u]}' | sort -t'%' -k2 -rn

Breakdown: $1 is username, $4 is %MEM. Accumulates memory per user into associative array mem. END block iterates all users, printf formats with padding. Piped to sort for descending order.

[!TIP] When to use: Quick triage during memory pressure — who's the hog?

Graph connections per host (ASCII histogram)¶

netstat -an | grep ESTABLISHED | awk '{print $5}' | awk -F: '{print $1}' | sort | uniq -c | awk '{printf "%-20s %5d ", $2, $1; for(i=0;i<$1;i++) printf "*"; print ""}'

Breakdown: Three-stage awk pipeline: (1) extract remote address field, (2) strip port with : as field separator, (3) format into aligned histogram where each * = one connection. The printf with %-20s left-aligns IPs.

[!TIP] When to use: Investigating connection storms, DDoS patterns, or connection pooling issues.

Display text between two patterns¶

awk '/START_PATTERN/,/END_PATTERN/' file.txt

Breakdown: AWK range pattern — when a line matches the first regex, awk starts printing. It continues until a line matches the second regex (inclusive). No variables, no state management needed.

[!TIP] When to use: Extracting config blocks, log windows between timestamps, or function bodies from source files.

Sum a column of numbers¶

awk '{sum += $1} END {print sum}' file.txt

Breakdown: Accumulator pattern — sum starts at 0 (awk default for uninitialized). Each line adds field 1. END fires once after all input is consumed. Change $1 to $N for any column.

[!TIP] When to use: Summing byte counts, request durations, error counts from log extracts.

Exclude specific columns¶

awk '{$2=$4=""; print}' file.txt

Breakdown: Nullifies fields 2 and 4, then prints the whole record. OFS (output field separator) collapses them. Note: this changes spacing — pipe through column -t to re-align.

[!TIP] When to use: Stripping sensitive columns (passwords, tokens) from tabular output before sharing.

Analyze failed SSH logins from auth logs¶

sudo zcat /var/log/auth.log.*.gz | awk '/Failed password/ && !/invalid user/ {a[$9]++} /Failed password for invalid user/ {a["*"$11]++} END {for (i in a) printf "%6d\t%s\n", a[i], i}' | sort -rn | head -20

Breakdown: Two pattern-action pairs: valid usernames go into a[$9], invalid users (prefixed with *) into a[$11] — field positions differ because "invalid user" adds extra words. END block formats counts. sort -rn puts worst offenders first.

[!TIP] When to use: Post-incident forensics, brute-force detection, fail2ban tuning.

Apache/Nginx top talkers from access log¶

awk '{ip[$1]++} END {for (i in ip) printf "%6d %s\n", ip[i], i}' access.log | sort -rn | head -20

Breakdown: Classic frequency counter — $1 is the client IP in Common Log Format. Associative array ip counts hits per IP. Same pattern works for any field: $7 for URLs, $9 for status codes.

[!TIP] When to use: Identifying abusive IPs, traffic spikes, or bot activity.

Generate ASCII histogram from log timestamps¶

awk '{print substr($0,1,15)}' /var/log/syslog | uniq -c | sort -rn | head -20 | awk '{printf "%5d %s ", $1, $2; for(i=0;i<$1/10;i++) printf "#"; print ""}'

Breakdown: First awk extracts the timestamp prefix (first 15 chars covers Mon DD HH:MM:SS). uniq -c counts consecutive identical timestamps. Second awk draws bars scaled by /10 to keep width manageable.

[!TIP] When to use: Visualizing log bursts, correlating error spikes with events.

Find long lines in files (style/config auditing)¶

awk 'length > 120 {printf "%s:%d (%d chars): %s\n", FILENAME, NR, length, substr($0,1,80)"..."}' *.conf

Breakdown: length with no argument returns length of $0. FILENAME and NR are built-in — file name and line number. substr truncates the preview. Processes multiple files in one pass.

[!TIP] When to use: Enforcing line-length limits in config files, finding runaway log lines.

User/group relationship diagram (graphviz)¶

awk -F: 'BEGIN {print "digraph groups {"} {split($4,members,","); for(m in members) printf "  \"%s\" -> \"%s\"\n", members[m], $1} END {print "}"}' /etc/group | dot -Tpng -o groups.png

Breakdown: BEGIN opens the digraph. Main body splits the member list (field 4, comma-delimited) and creates edges from each member to the group name (field 1). END closes the graph. Piped to graphviz dot.

[!TIP] When to use: Auditing group memberships, visualizing RBAC structures, onboarding documentation.

Transpose rows to columns¶

awk '{for(i=1;i<=NF;i++) a[i]=a[i]" "$i} END {for(i=1;i<=NF;i++) print a[i]}' file.txt

Breakdown: Builds array a where each index i accumulates the i-th field from every row. After all input, prints each accumulated row (which is now a column). NF = number of fields.

[!TIP] When to use: Reformatting monitoring output, pivoting CSV data, preparing data for plotting.

Print line BEFORE a regex match¶

awk '/regex/{print prev} {prev=$0}' file.txt

Breakdown: Every line, prev stores the current line. When the next line matches, we print prev — which is the line before the match. This is the awk equivalent of grep -B1 but without needing GNU grep.

[!TIP] When to use: Finding what happened right before an error in a log, getting context before a pattern.

Print line AFTER a regex match¶

awk '/regex/{getline; print}' file.txt

Breakdown: getline reads the next input line into $0, advancing NR. So when the regex matches, we immediately consume the next line and print it. Subtle: if the next line also matches, its successor won't be printed (getline already consumed it).

[!TIP] When to use: Extracting values that appear on the line after a label/header in structured output.

Reverse field order on each line¶

awk '{for (i=NF; i>0; i--) printf "%s ", $i; print ""}' file.txt

Breakdown: Loops from the last field (NF) down to 1, printing each with a space. The final print "" adds the newline. This reverses column order without touching line order.

[!TIP] When to use: Swapping column order in logs, reversing CSV fields, reformatting data for different consumers.

Join every N lines into one (comma-separated)¶

awk 'ORS=NR%5?",":"\n"' file.txt

Breakdown: ORS (Output Record Separator) is dynamically set each line. If NR%5 is nonzero, ORS becomes , (joining to previous). Every 5th line, ORS becomes \n (ending the group). This is awk at its most terse — the entire program is one assignment that doubles as the condition-action.

[!TIP] When to use: Collapsing multi-line records into CSV rows, reformatting list output for spreadsheets, batching items.

Find the maximum value and its line¶

awk '$1 > max {max=$1; maxline=$0} END {print max, maxline}' file.txt

Breakdown: Compares field 1 of each line against running max. When a new max is found, stores both the value and the full line. END prints the winner. Change $1 to any field. Change > to < for minimum.

[!TIP] When to use: Finding the slowest request in a log, the largest file in du output, the highest-latency host.

Count all words in a file (wc -w alternative)¶

awk '{total += NF} END {print total}' file.txt

Breakdown: NF (Number of Fields) is the word count per line. Accumulate across all lines, print at end. Simpler and more composable than wc -w in pipelines.

[!TIP] When to use: Quick word counts, validating data density, checking if a file has content.

Count pattern occurrences¶

awk '/ERROR/{n++} END {print n+0}' /var/log/app.log

Breakdown: Increments counter on each match. n+0 in the END block ensures 0 is printed (not blank) when there are no matches — because uninitialized awk variables are empty strings, and +0 forces numeric context.

[!TIP] When to use: Counting errors, warnings, or specific events in logs. The +0 trick is a classic awk idiom worth internalizing.

Squeeze whitespace (normalize fields)¶

awk '{$1=$1}1' file.txt

Breakdown: Assigning $1=$1 forces awk to reconstruct $0 using OFS (default: single space). This collapses all runs of whitespace (tabs, multiple spaces) into single spaces. The 1 is a true condition that triggers the default action (print). This is the most cryptic-looking yet useful awk idiom.

[!TIP] When to use: Normalizing messy output from ps, df, mount, or any command with variable-width columns before further processing.

Reverse all lines in a file (tac alternative)¶

awk '{a[NR]=$0} END {for (i=NR; i>=1; i--) print a[i]}' file.txt

Breakdown: Stores every line in array a indexed by line number. END block walks backward from NR to 1. This is tac in pure awk — works on systems where tac isn't installed (macOS, some BSDs).

[!TIP] When to use: Reading logs bottom-up (most recent first), reversing sort order, processing files in reverse.

Print lines between line numbers (inclusive)¶

awk 'NR==8,NR==12' file.txt

Breakdown: Range pattern using line numbers instead of regexes. Activates at line 8, deactivates after line 12. Cleaner than sed -n '8,12p' because awk's range syntax is more readable and composable.

[!TIP] When to use: Extracting a known section from a file by line number, debugging specific log ranges.

Multi-pattern substitution (normalize variants)¶

awk '{gsub(/scarlet|ruby|puce/, "red")} 1' file.txt

Breakdown: gsub with alternation (|) replaces ALL variants in one pass. The 1 triggers print. Unlike sed which needs separate s/// commands for each pattern, awk handles alternation natively in one gsub.

[!TIP] When to use: Normalizing log levels (warn/warning/WARN to WARNING), standardizing status codes, cleaning up inconsistent data.

Join backslash-continued lines¶

awk '/\\$/ {sub(/\\$/,""); getline t; print $0 t; next} 1' file.txt

Breakdown: If a line ends with \, strip the backslash, read the next line into variable t via getline, concatenate and print. next skips to the next input line. The trailing 1 prints non-continued lines normally.

[!TIP] When to use: Processing Makefiles, shell scripts, or config files that use \ continuation. Useful for feeding to tools that don't understand continuation.