Linux Text Processing - Street-Level Ops¶

Real-world text processing workflows for log analysis, data extraction, config comparison, and operational reporting.

Top N Items from Logs¶

The most common pattern in ops: "what are the most frequent X?"

# Top 20 IP addresses hitting the server
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20
#  15234 10.0.1.55
#   8921 203.0.113.42
#   4562 198.51.100.7

# Top 10 HTTP status codes
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10
#  89234 200
#   5621 301
#   2341 404
#    891 500
#    234 503

# Top 10 requested URLs
awk '{print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10

# Top 10 user agents
awk -F'"' '{print $6}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10

# Most common error messages in application logs
grep -i "error" /var/log/app/app.log | \
  sed 's/^[0-9-]* [0-9:]*//' | \
  sort | uniq -c | sort -rn | head -10

# Top 10 slowest endpoints (assuming log format includes response time)
awk '{print $NF, $7}' /var/log/nginx/access.log | \
  sort -rn | head -10
#  4.523 /api/reports/generate
#  3.891 /api/export/csv
#  2.145 /api/search

CSV Field Extraction and Analysis¶

Working with CSV data using standard tools.

# Sample CSV
cat sales.csv
# date,region,product,quantity,amount
# 2024-01-15,east,widget,100,2500.00
# 2024-01-15,west,gadget,50,3750.00
# 2024-01-16,east,widget,120,3000.00

# Extract specific columns (skip header)
tail -n +2 sales.csv | cut -d, -f2,5
# east,2500.00
# west,3750.00
# east,3000.00

# Total sales per region
tail -n +2 sales.csv | cut -d, -f2,5 | \
  sort -t, -k1,1 | \
  awk -F, '{sum[$1]+=$2} END {for (r in sum) printf "%s: $%.2f\n", r, sum[r]}' | \
  sort -t: -k2 -rn
# west: $3750.00
# east: $5500.00

# Unique values in a column (field 3 = product)
tail -n +2 sales.csv | cut -d, -f3 | sort -u
# gadget
# widget

# Count records per date
tail -n +2 sales.csv | cut -d, -f1 | sort | uniq -c
#    2 2024-01-15
#    1 2024-01-16

# Filter rows where quantity > 100 and extract product + amount
tail -n +2 sales.csv | awk -F, '$4 > 100 {print $3, $5}'
# widget 3000.00

# Convert CSV to TSV
tr ',' '\t' < sales.csv > sales.tsv

# Convert TSV back to CSV
tr '\t' ',' < sales.tsv > sales.csv

# Add a header to headerless data
echo "name,age,city" | cat - data.csv > data_with_header.csv

# Extract a single column and format as comma-separated list
tail -n +2 sales.csv | cut -d, -f3 | sort -u | paste -sd,
# gadget,widget

Comparing Config Files¶

Finding what changed between two versions of a configuration.

# Unified diff (most readable)
diff -u /etc/nginx/nginx.conf.bak /etc/nginx/nginx.conf
# --- /etc/nginx/nginx.conf.bak
# +++ /etc/nginx/nginx.conf
# @@ -15,7 +15,8 @@
#      server {
# -        listen 80;
# +        listen 443 ssl;
# +        ssl_certificate /etc/ssl/certs/app.pem;
#          server_name example.com;

# Just show whether files differ (for scripting)
diff -q /etc/old/ /etc/new/
# Files /etc/old/nginx.conf and /etc/new/nginx.conf differ
# Only in /etc/new/: ssl.conf

# Recursive comparison of directories
diff -rq /etc/nginx.bak/ /etc/nginx/
# Files /etc/nginx.bak/nginx.conf and /etc/nginx/nginx.conf differ
# Only in /etc/nginx/: conf.d/ssl.conf

# Side-by-side comparison
diff -y --width=120 old.conf new.conf | head -30

# Ignore comment lines and blank lines when comparing
diff <(grep -v '^#\|^$' old.conf) <(grep -v '^#\|^$' new.conf)

# Compare sorted lists (e.g., installed packages on two servers)
comm -3 <(ssh server1 'rpm -qa | sort') <(ssh server2 'rpm -qa | sort')
# Left column: only on server1
# Right column: only on server2

# Find config lines in file1 but not in file2
comm -23 <(sort file1.conf) <(sort file2.conf)

Deduplicating Lists¶

Gotcha: uniq only removes adjacent duplicates. Running uniq on unsorted input silently passes through duplicates that are not next to each other. Always pipe through sort first, or use awk '!seen[$0]++' if you need to preserve original order.

# Remove duplicates (sorted output)
sort names.txt | uniq

# Remove duplicates preserving original order
awk '!seen[$0]++' names.txt

# Remove duplicates case-insensitively
sort -f names.txt | uniq -i

# Find and count duplicates
sort names.txt | uniq -d       # show only duplicated lines
sort names.txt | uniq -dc      # show duplicated lines with counts

# Deduplicate a CSV based on a specific column (keep first occurrence)
awk -F, '!seen[$1]++' data.csv
# Keeps the first row for each unique value in column 1

Analyzing Access Logs¶

Real operational patterns for nginx/Apache combined log format.

# Format: IP - - [date] "method url proto" status size "referer" "user-agent"

# Requests per hour
awk '{print $4}' access.log | cut -d: -f1,2 | sort | uniq -c | sort -rn
#   5432 [15/Jan/2024:14
#   4891 [15/Jan/2024:15
#   3201 [15/Jan/2024:13

# Count unique visitors (unique IPs)
awk '{print $1}' access.log | sort -u | wc -l
# 2847

# Bandwidth per IP (field 10 = bytes sent)
awk '{sum[$1]+=$10} END {for (ip in sum) print sum[ip], ip}' access.log | \
  sort -rn | head -10
# 1534892345 10.0.1.55
#  892345123 203.0.113.42

# Find all 5xx errors with their URLs
awk '$9 ~ /^5/ {print $9, $7}' access.log | sort | uniq -c | sort -rn
#   234 500 /api/process
#    89 502 /api/export
#    45 503 /api/search

# Requests per minute during an incident window
awk '/15\/Jan\/2024:14:3[0-5]/ {print $4}' access.log | \
  cut -d: -f1-3 | sort | uniq -c
#   892 [15/Jan/2024:14:30
#  1234 [15/Jan/2024:14:31
#  2345 [15/Jan/2024:14:32   <-- spike starts here
#  5678 [15/Jan/2024:14:33
#  4321 [15/Jan/2024:14:34

# Find POST requests with large bodies (if logged)
awk '$6 == "\"POST" && $10 > 1000000 {print $1, $7, $10}' access.log

# Response time distribution (if response time is the last field)
awk '{print $NF}' access.log | sort -n | \
  awk '{
    all[NR] = $1
    sum += $1
  }
  END {
    printf "Count: %d\n", NR
    printf "Mean:  %.3f\n", sum/NR
    printf "P50:   %.3f\n", all[int(NR*0.50)]
    printf "P95:   %.3f\n", all[int(NR*0.95)]
    printf "P99:   %.3f\n", all[int(NR*0.99)]
    printf "Max:   %.3f\n", all[NR]
  }'
# Count: 145892
# Mean:  0.234
# P50:   0.089
# P95:   0.892
# P99:   2.345
# Max:   12.456

Finding Most Common Errors¶

# Extract unique error patterns (strip timestamps and variable data)
grep "ERROR" app.log | \
  sed 's/^[0-9-]* [0-9:,]* //' | \
  sed 's/[0-9a-f]\{8\}-[0-9a-f]\{4\}-[0-9a-f]\{4\}-[0-9a-f]\{4\}-[0-9a-f]\{12\}/UUID/g' | \
  sort | uniq -c | sort -rn | head -10
#  1234 ERROR ConnectionError: database connection timeout
#   892 ERROR ValueError: invalid request payload for UUID
#   456 ERROR TimeoutError: upstream service did not respond

# Error count per hour (for trending)
grep "ERROR" app.log | \
  awk '{print substr($1, 1, 13)}' | \
  sort | uniq -c
#   45 2024-01-15 08
#   52 2024-01-15 09
#  234 2024-01-15 10   <-- spike
#   61 2024-01-15 11

# Errors by category
grep "ERROR" app.log | \
  awk -F'ERROR ' '{print $2}' | \
  cut -d: -f1 | \
  sort | uniq -c | sort -rn
#  1234 ConnectionError
#   892 ValueError
#   456 TimeoutError
#   123 PermissionError

Text-Based Reporting¶

Generate reports from raw data without external tools.

# System resource summary
echo "=== Disk Usage ==="
df -h | tail -n +2 | sort -k5 -rn | head -5 | column -t
echo ""
echo "=== Top Memory Consumers ==="
ps aux --sort=-%mem | head -6 | awk '{printf "%-10s %5s%% %5s%% %s\n", $1, $3, $4, $11}'
echo ""
echo "=== Top CPU Consumers ==="
ps aux --sort=-%cpu | head -6 | awk '{printf "%-10s %5s%% %5s%% %s\n", $1, $3, $4, $11}'

# Generate a formatted table from pipe-separated data
echo "Host|Status|Uptime|Load" > /tmp/report.txt
echo "web-01|UP|45d|0.23" >> /tmp/report.txt
echo "web-02|UP|12d|0.89" >> /tmp/report.txt
echo "db-01|DOWN|0d|N/A" >> /tmp/report.txt
column -t -s'|' /tmp/report.txt
# Host    Status  Uptime  Load
# web-01  UP      45d     0.23
# web-02  UP      12d     0.89
# db-01   DOWN    0d      N/A

# Count lines per log level
for level in DEBUG INFO WARN ERROR FATAL; do
  count=$(grep -c "$level" app.log 2>/dev/null || echo 0)
  printf "%-8s %d\n" "$level" "$count"
done
# DEBUG    45231
# INFO     23456
# WARN      3421
# ERROR     1234
# FATAL       12

Data Pipeline: Multi-Step Processing¶

Complex real-world pipelines combining multiple tools.

# Pipeline 1: Find which services are generating the most errors
# Input: JSON-lines log format {"service": "auth", "level": "ERROR", ...}
grep '"level":"ERROR"' services.jsonl | \
  grep -o '"service":"[^"]*"' | \
  cut -d'"' -f4 | \
  sort | uniq -c | sort -rn
#  1234 payment-service
#   567 auth-service
#    89 user-service

# Pipeline 2: Network connections by state
ss -tan | tail -n +2 | awk '{print $1}' | sort | uniq -c | sort -rn
#   234 ESTAB
#    56 TIME-WAIT
#    12 CLOSE-WAIT
#     3 LISTEN

# Pipeline 3: Find processes with the most open file descriptors
for pid in /proc/[0-9]*/fd; do
  pid_num=$(echo $pid | cut -d/ -f3)
  count=$(ls -1 $pid 2>/dev/null | wc -l)
  name=$(cat /proc/$pid_num/comm 2>/dev/null)
  echo "$count $pid_num $name"
done 2>/dev/null | sort -rn | head -10
#  1234 5678 nginx
#   456 2345 postgres
#   123 8901 java

# Pipeline 4: Summarize a directory tree by file type and size
find /var/log -type f -printf '%s %f\n' 2>/dev/null | \
  awk '{
    ext = $2
    sub(/.*\./, "", ext)
    if (ext == $2) ext = "no-ext"
    count[ext]++
    size[ext] += $1
  }
  END {
    for (e in count) printf "%d files, %.1f MB: %s\n", count[e], size[e]/1048576, e
  }' | sort -t, -k2 -rn
# 45 files, 234.5 MB: log
# 23 files, 12.3 MB: gz
# 12 files, 0.1 MB: no-ext

Converting Between Tabs and Spaces in Codebases¶

# Find files with tabs
grep -rl $'\t' --include='*.py' .
# ./old_module.py
# ./legacy/utils.py

# Preview the conversion
expand -t 4 old_module.py | diff -u old_module.py -
# Shows what would change

# Convert tabs to 4 spaces in all Python files
find . -name "*.py" -exec sh -c '
  if grep -q "  " "$1"; then
    expand -t 4 "$1" > "$1.tmp" && mv "$1.tmp" "$1"
    echo "Converted: $1"
  fi
' _ {} \;

# Convert spaces to tabs (less common, for Makefiles)
unexpand --first-only -t 4 file.py

# Check for mixed tabs and spaces
awk '/\t/ && / / {print FILENAME": line "NR": mixed tabs and spaces"; exit}' *.py

Quick Data Transformations¶

One-liners for common format conversions.

# Comma-separated to one-per-line
echo "alice,bob,carol,dave" | tr ',' '\n'
# alice
# bob
# carol
# dave

# One-per-line to comma-separated
cat names.txt | paste -sd,
# alice,bob,carol,dave

# Swap columns in a TSV (using awk since cut cannot reorder)
awk -F'\t' '{print $3 "\t" $1 "\t" $2}' data.tsv

# Remove blank lines
grep -v '^$' file.txt
# or: sed '/^$/d' file.txt

# Remove leading/trailing whitespace
sed 's/^[[:space:]]*//; s/[[:space:]]*$//' file.txt

# Number non-empty lines
nl -ba file.txt | grep -v '^ *[0-9]*\t$'

# Generate a sequence and format it
seq -w 1 100 | paste - - - - -
# 001  002  003  004  005
# 006  007  008  009  010
# ...

# Join two files on a common field (like SQL JOIN)
# file1: id,name    file2: id,salary
join -t, <(sort file1.csv) <(sort file2.csv)
# id,name,salary (matched on first field)

# Extract email addresses from text
grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' document.txt | sort -u

Power One-Liners¶

Find duplicate files by content (not name)¶

find . -not -empty -type f -printf '%s\n' | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

Breakdown: Three-pass approach: (1) find all file sizes, (2) identify duplicate sizes (fast filter), (3) for each duplicate size, compute md5sum and group by hash. The size pre-filter avoids hashing every file.

[!TIP] When to use: Disk cleanup, finding redundant config copies, deduplicating backup artifacts.

Print specific line N from a file¶

awk 'NR==42' file.txt        # awk way (instant)
sed -n '42p' file.txt        # sed way
sed '42q;d' file.txt         # sed way (quits early — faster on huge files)

[!TIP] When to use: Jumping to a specific line referenced in an error message or stack trace.

Full-width terminal separator line¶

printf '%*s\n' "${COLUMNS:-80}" '' | tr ' ' '─'

Breakdown: printf '%*s' prints N spaces where N comes from $COLUMNS (terminal width, defaulting to 80). tr replaces all spaces with the separator character.

[!TIP] When to use: Visual separation in script output, log formatting, dashboards.