Linux Text Processing - Street-Level Ops¶
Real-world text processing workflows for log analysis, data extraction, config comparison, and operational reporting.
Top N Items from Logs¶
The most common pattern in ops: "what are the most frequent X?"
# Top 20 IP addresses hitting the server
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20
# 15234 10.0.1.55
# 8921 203.0.113.42
# 4562 198.51.100.7
# Top 10 HTTP status codes
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10
# 89234 200
# 5621 301
# 2341 404
# 891 500
# 234 503
# Top 10 requested URLs
awk '{print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10
# Top 10 user agents
awk -F'"' '{print $6}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10
# Most common error messages in application logs
grep -i "error" /var/log/app/app.log | \
sed 's/^[0-9-]* [0-9:]*//' | \
sort | uniq -c | sort -rn | head -10
# Top 10 slowest endpoints (assuming log format includes response time)
awk '{print $NF, $7}' /var/log/nginx/access.log | \
sort -rn | head -10
# 4.523 /api/reports/generate
# 3.891 /api/export/csv
# 2.145 /api/search
CSV Field Extraction and Analysis¶
Working with CSV data using standard tools.
# Sample CSV
cat sales.csv
# date,region,product,quantity,amount
# 2024-01-15,east,widget,100,2500.00
# 2024-01-15,west,gadget,50,3750.00
# 2024-01-16,east,widget,120,3000.00
# Extract specific columns (skip header)
tail -n +2 sales.csv | cut -d, -f2,5
# east,2500.00
# west,3750.00
# east,3000.00
# Total sales per region
tail -n +2 sales.csv | cut -d, -f2,5 | \
sort -t, -k1,1 | \
awk -F, '{sum[$1]+=$2} END {for (r in sum) printf "%s: $%.2f\n", r, sum[r]}' | \
sort -t: -k2 -rn
# west: $3750.00
# east: $5500.00
# Unique values in a column (field 3 = product)
tail -n +2 sales.csv | cut -d, -f3 | sort -u
# gadget
# widget
# Count records per date
tail -n +2 sales.csv | cut -d, -f1 | sort | uniq -c
# 2 2024-01-15
# 1 2024-01-16
# Filter rows where quantity > 100 and extract product + amount
tail -n +2 sales.csv | awk -F, '$4 > 100 {print $3, $5}'
# widget 3000.00
# Convert CSV to TSV
tr ',' '\t' < sales.csv > sales.tsv
# Convert TSV back to CSV
tr '\t' ',' < sales.tsv > sales.csv
# Add a header to headerless data
echo "name,age,city" | cat - data.csv > data_with_header.csv
# Extract a single column and format as comma-separated list
tail -n +2 sales.csv | cut -d, -f3 | sort -u | paste -sd,
# gadget,widget
Comparing Config Files¶
Finding what changed between two versions of a configuration.
# Unified diff (most readable)
diff -u /etc/nginx/nginx.conf.bak /etc/nginx/nginx.conf
# --- /etc/nginx/nginx.conf.bak
# +++ /etc/nginx/nginx.conf
# @@ -15,7 +15,8 @@
# server {
# - listen 80;
# + listen 443 ssl;
# + ssl_certificate /etc/ssl/certs/app.pem;
# server_name example.com;
# Just show whether files differ (for scripting)
diff -q /etc/old/ /etc/new/
# Files /etc/old/nginx.conf and /etc/new/nginx.conf differ
# Only in /etc/new/: ssl.conf
# Recursive comparison of directories
diff -rq /etc/nginx.bak/ /etc/nginx/
# Files /etc/nginx.bak/nginx.conf and /etc/nginx/nginx.conf differ
# Only in /etc/nginx/: conf.d/ssl.conf
# Side-by-side comparison
diff -y --width=120 old.conf new.conf | head -30
# Ignore comment lines and blank lines when comparing
diff <(grep -v '^#\|^$' old.conf) <(grep -v '^#\|^$' new.conf)
# Compare sorted lists (e.g., installed packages on two servers)
comm -3 <(ssh server1 'rpm -qa | sort') <(ssh server2 'rpm -qa | sort')
# Left column: only on server1
# Right column: only on server2
# Find config lines in file1 but not in file2
comm -23 <(sort file1.conf) <(sort file2.conf)
Deduplicating Lists¶
Gotcha:
uniqonly removes adjacent duplicates. Runninguniqon unsorted input silently passes through duplicates that are not next to each other. Always pipe throughsortfirst, or useawk '!seen[$0]++'if you need to preserve original order.
# Remove duplicates (sorted output)
sort names.txt | uniq
# Remove duplicates preserving original order
awk '!seen[$0]++' names.txt
# Remove duplicates case-insensitively
sort -f names.txt | uniq -i
# Find and count duplicates
sort names.txt | uniq -d # show only duplicated lines
sort names.txt | uniq -dc # show duplicated lines with counts
# Deduplicate a CSV based on a specific column (keep first occurrence)
awk -F, '!seen[$1]++' data.csv
# Keeps the first row for each unique value in column 1
Analyzing Access Logs¶
Real operational patterns for nginx/Apache combined log format.
# Format: IP - - [date] "method url proto" status size "referer" "user-agent"
# Requests per hour
awk '{print $4}' access.log | cut -d: -f1,2 | sort | uniq -c | sort -rn
# 5432 [15/Jan/2024:14
# 4891 [15/Jan/2024:15
# 3201 [15/Jan/2024:13
# Count unique visitors (unique IPs)
awk '{print $1}' access.log | sort -u | wc -l
# 2847
# Bandwidth per IP (field 10 = bytes sent)
awk '{sum[$1]+=$10} END {for (ip in sum) print sum[ip], ip}' access.log | \
sort -rn | head -10
# 1534892345 10.0.1.55
# 892345123 203.0.113.42
# Find all 5xx errors with their URLs
awk '$9 ~ /^5/ {print $9, $7}' access.log | sort | uniq -c | sort -rn
# 234 500 /api/process
# 89 502 /api/export
# 45 503 /api/search
# Requests per minute during an incident window
awk '/15\/Jan\/2024:14:3[0-5]/ {print $4}' access.log | \
cut -d: -f1-3 | sort | uniq -c
# 892 [15/Jan/2024:14:30
# 1234 [15/Jan/2024:14:31
# 2345 [15/Jan/2024:14:32 <-- spike starts here
# 5678 [15/Jan/2024:14:33
# 4321 [15/Jan/2024:14:34
# Find POST requests with large bodies (if logged)
awk '$6 == "\"POST" && $10 > 1000000 {print $1, $7, $10}' access.log
# Response time distribution (if response time is the last field)
awk '{print $NF}' access.log | sort -n | \
awk '{
all[NR] = $1
sum += $1
}
END {
printf "Count: %d\n", NR
printf "Mean: %.3f\n", sum/NR
printf "P50: %.3f\n", all[int(NR*0.50)]
printf "P95: %.3f\n", all[int(NR*0.95)]
printf "P99: %.3f\n", all[int(NR*0.99)]
printf "Max: %.3f\n", all[NR]
}'
# Count: 145892
# Mean: 0.234
# P50: 0.089
# P95: 0.892
# P99: 2.345
# Max: 12.456
Finding Most Common Errors¶
# Extract unique error patterns (strip timestamps and variable data)
grep "ERROR" app.log | \
sed 's/^[0-9-]* [0-9:,]* //' | \
sed 's/[0-9a-f]\{8\}-[0-9a-f]\{4\}-[0-9a-f]\{4\}-[0-9a-f]\{4\}-[0-9a-f]\{12\}/UUID/g' | \
sort | uniq -c | sort -rn | head -10
# 1234 ERROR ConnectionError: database connection timeout
# 892 ERROR ValueError: invalid request payload for UUID
# 456 ERROR TimeoutError: upstream service did not respond
# Error count per hour (for trending)
grep "ERROR" app.log | \
awk '{print substr($1, 1, 13)}' | \
sort | uniq -c
# 45 2024-01-15 08
# 52 2024-01-15 09
# 234 2024-01-15 10 <-- spike
# 61 2024-01-15 11
# Errors by category
grep "ERROR" app.log | \
awk -F'ERROR ' '{print $2}' | \
cut -d: -f1 | \
sort | uniq -c | sort -rn
# 1234 ConnectionError
# 892 ValueError
# 456 TimeoutError
# 123 PermissionError
Text-Based Reporting¶
Generate reports from raw data without external tools.
# System resource summary
echo "=== Disk Usage ==="
df -h | tail -n +2 | sort -k5 -rn | head -5 | column -t
echo ""
echo "=== Top Memory Consumers ==="
ps aux --sort=-%mem | head -6 | awk '{printf "%-10s %5s%% %5s%% %s\n", $1, $3, $4, $11}'
echo ""
echo "=== Top CPU Consumers ==="
ps aux --sort=-%cpu | head -6 | awk '{printf "%-10s %5s%% %5s%% %s\n", $1, $3, $4, $11}'
# Generate a formatted table from pipe-separated data
echo "Host|Status|Uptime|Load" > /tmp/report.txt
echo "web-01|UP|45d|0.23" >> /tmp/report.txt
echo "web-02|UP|12d|0.89" >> /tmp/report.txt
echo "db-01|DOWN|0d|N/A" >> /tmp/report.txt
column -t -s'|' /tmp/report.txt
# Host Status Uptime Load
# web-01 UP 45d 0.23
# web-02 UP 12d 0.89
# db-01 DOWN 0d N/A
# Count lines per log level
for level in DEBUG INFO WARN ERROR FATAL; do
count=$(grep -c "$level" app.log 2>/dev/null || echo 0)
printf "%-8s %d\n" "$level" "$count"
done
# DEBUG 45231
# INFO 23456
# WARN 3421
# ERROR 1234
# FATAL 12
Data Pipeline: Multi-Step Processing¶
Complex real-world pipelines combining multiple tools.
# Pipeline 1: Find which services are generating the most errors
# Input: JSON-lines log format {"service": "auth", "level": "ERROR", ...}
grep '"level":"ERROR"' services.jsonl | \
grep -o '"service":"[^"]*"' | \
cut -d'"' -f4 | \
sort | uniq -c | sort -rn
# 1234 payment-service
# 567 auth-service
# 89 user-service
# Pipeline 2: Network connections by state
ss -tan | tail -n +2 | awk '{print $1}' | sort | uniq -c | sort -rn
# 234 ESTAB
# 56 TIME-WAIT
# 12 CLOSE-WAIT
# 3 LISTEN
# Pipeline 3: Find processes with the most open file descriptors
for pid in /proc/[0-9]*/fd; do
pid_num=$(echo $pid | cut -d/ -f3)
count=$(ls -1 $pid 2>/dev/null | wc -l)
name=$(cat /proc/$pid_num/comm 2>/dev/null)
echo "$count $pid_num $name"
done 2>/dev/null | sort -rn | head -10
# 1234 5678 nginx
# 456 2345 postgres
# 123 8901 java
# Pipeline 4: Summarize a directory tree by file type and size
find /var/log -type f -printf '%s %f\n' 2>/dev/null | \
awk '{
ext = $2
sub(/.*\./, "", ext)
if (ext == $2) ext = "no-ext"
count[ext]++
size[ext] += $1
}
END {
for (e in count) printf "%d files, %.1f MB: %s\n", count[e], size[e]/1048576, e
}' | sort -t, -k2 -rn
# 45 files, 234.5 MB: log
# 23 files, 12.3 MB: gz
# 12 files, 0.1 MB: no-ext
Converting Between Tabs and Spaces in Codebases¶
# Find files with tabs
grep -rl $'\t' --include='*.py' .
# ./old_module.py
# ./legacy/utils.py
# Preview the conversion
expand -t 4 old_module.py | diff -u old_module.py -
# Shows what would change
# Convert tabs to 4 spaces in all Python files
find . -name "*.py" -exec sh -c '
if grep -q " " "$1"; then
expand -t 4 "$1" > "$1.tmp" && mv "$1.tmp" "$1"
echo "Converted: $1"
fi
' _ {} \;
# Convert spaces to tabs (less common, for Makefiles)
unexpand --first-only -t 4 file.py
# Check for mixed tabs and spaces
awk '/\t/ && / / {print FILENAME": line "NR": mixed tabs and spaces"; exit}' *.py
Quick Data Transformations¶
One-liners for common format conversions.
# Comma-separated to one-per-line
echo "alice,bob,carol,dave" | tr ',' '\n'
# alice
# bob
# carol
# dave
# One-per-line to comma-separated
cat names.txt | paste -sd,
# alice,bob,carol,dave
# Swap columns in a TSV (using awk since cut cannot reorder)
awk -F'\t' '{print $3 "\t" $1 "\t" $2}' data.tsv
# Remove blank lines
grep -v '^$' file.txt
# or: sed '/^$/d' file.txt
# Remove leading/trailing whitespace
sed 's/^[[:space:]]*//; s/[[:space:]]*$//' file.txt
# Number non-empty lines
nl -ba file.txt | grep -v '^ *[0-9]*\t$'
# Generate a sequence and format it
seq -w 1 100 | paste - - - - -
# 001 002 003 004 005
# 006 007 008 009 010
# ...
# Join two files on a common field (like SQL JOIN)
# file1: id,name file2: id,salary
join -t, <(sort file1.csv) <(sort file2.csv)
# id,name,salary (matched on first field)
# Extract email addresses from text
grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' document.txt | sort -u
Power One-Liners¶
Find duplicate files by content (not name)¶
find . -not -empty -type f -printf '%s\n' | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
Breakdown: Three-pass approach: (1) find all file sizes, (2) identify duplicate sizes (fast filter), (3) for each duplicate size, compute md5sum and group by hash. The size pre-filter avoids hashing every file.
[!TIP] When to use: Disk cleanup, finding redundant config copies, deduplicating backup artifacts.
Print specific line N from a file¶
awk 'NR==42' file.txt # awk way (instant)
sed -n '42p' file.txt # sed way
sed '42q;d' file.txt # sed way (quits early — faster on huge files)
[!TIP] When to use: Jumping to a specific line referenced in an error message or stack trace.
Full-width terminal separator line¶
Breakdown: printf '%*s' prints N spaces where N comes from $COLUMNS (terminal width, defaulting to 80). tr replaces all spaces with the separator character.
[!TIP] When to use: Visual separation in script output, log formatting, dashboards.