Linux Text Processing Footguns¶

Mistakes that produce wrong results silently, corrupt data, or waste hours of debugging with text processing tools.

1. uniq without sort first (only removes adjacent duplicates)¶

You pipe a log file through uniq to remove duplicates. The output still has duplicates everywhere. You check the file — the duplicate lines are there but not adjacent. uniq only compares consecutive lines.

# Input:
echo -e "apple\nbanana\napple\ncherry\nbanana" | uniq
# apple
# banana
# apple      <-- still here!
# cherry
# banana     <-- still here!

# Fix: always sort first
echo -e "apple\nbanana\napple\ncherry\nbanana" | sort | uniq
# apple
# banana
# cherry

# If you need to preserve original order, use awk instead
echo -e "apple\nbanana\napple\ncherry\nbanana" | awk '!seen[$0]++'
# apple
# banana
# cherry

This is the single most common text processing mistake. Every time you type uniq, ask yourself: "is this input sorted?"

2. cut cannot reorder fields¶

You have a CSV and want to output fields in a different order: field 3 first, then field 1.

echo "a,b,c,d" | cut -d, -f3,1
# a,c
# NOT c,a — cut always outputs in original order regardless of -f argument order

# Fix: use awk for field reordering
echo "a,b,c,d" | awk -F, '{print $3","$1}'
# c,a

cut -f3,1 and cut -f1,3 produce identical output. cut respects the original field order, always.

3. sort locale affecting order (LC_ALL=C for byte sort)¶

You sort a list of strings and get unexpected results. Uppercase and lowercase are interleaved. Numbers sort wrong. Accented characters appear in strange places.

# With default locale (en_US.UTF-8):
echo -e "banana\nApple\ncherry" | sort
# Apple
# banana
# cherry
# (locale-aware: case-insensitive collation)

# With C locale (byte order):
echo -e "banana\nApple\ncherry" | LC_ALL=C sort
# Apple
# banana
# cherry
# (byte order: uppercase before lowercase)

# Numbers with locale:
echo -e "10\n9\n100\n2" | sort
# 10      <-- locale may treat this as "1" then "0"
# 100
# 2
# 9

# Fix: explicit numeric sort
echo -e "10\n9\n100\n2" | sort -n
# 2
# 9
# 10
# 100

Rule: When you need deterministic, reproducible sorting (especially in scripts), use LC_ALL=C sort. The default locale-based sort is designed for human-readable output but can produce surprising results in pipelines.

4. tr does not support strings (it is character-by-character)¶

You try to replace a word using tr.

# You want to replace "cat" with "dog"
echo "the cat sat" | tr 'cat' 'dog'
# dhe dog sod
# WRONG! tr mapped: c->d, a->o, t->g (character by character)

# Fix: use sed for string replacement
echo "the cat sat" | sed 's/cat/dog/g'
# the dog sat

tr 'abc' 'xyz' means: replace every 'a' with 'x', every 'b' with 'y', every 'c' with 'z'. It is a character translation table, not a string substitution.

5. diff exit codes (1 = differences found, not error)¶

You use diff in a script with set -e (exit on error). The script exits unexpectedly when comparing two files that are different.

#!/bin/bash
set -e

# This kills the script if files differ (exit code 1)
diff old.conf new.conf > changes.txt
echo "This line never runs if files differ"

# diff exit codes:
# 0 = files are identical
# 1 = files differ (NOT an error)
# 2 = actual error (file not found, etc.)

# Fix: handle the exit code explicitly
diff old.conf new.conf > changes.txt || true
# or:
if diff -q old.conf new.conf > /dev/null 2>&1; then
  echo "Files are identical"
else
  echo "Files differ"
fi

6. tail -f vs -F (log rotation breaks -f)¶

You monitor a log file with tail -f. Logrotate runs and renames the file. tail -f continues following the old (now renamed) file descriptor. New log entries go to the new file, but your terminal shows nothing.

# What happens:
# 1. tail -f /var/log/app.log (follows inode 12345)
# 2. logrotate renames app.log to app.log.1 (inode 12345 is now app.log.1)
# 3. logrotate creates new app.log (inode 67890)
# 4. tail -f is still following inode 12345 (the old file)
# 5. New log entries go to inode 67890 — you see nothing

# Fix: use -F (capital F)
tail -F /var/log/app.log
# -F = --follow=name --retry
# Follows the file BY NAME, not by inode
# When the file is replaced, tail reopens it

Always use tail -F for log monitoring in production. The only reason to use tail -f is when you are following a file that will never be rotated (e.g., a named pipe).

7. wc -l counts newlines, not lines (missing final newline = wrong count)¶

A file has 3 lines of text but no trailing newline after the last line. wc -l reports 2.

printf "line1\nline2\nline3" > no_trailing_newline.txt
wc -l no_trailing_newline.txt
# 2 no_trailing_newline.txt   <-- reports 2, not 3!

printf "line1\nline2\nline3\n" > with_trailing_newline.txt
wc -l with_trailing_newline.txt
# 3 with_trailing_newline.txt  <-- correct

# Fix: if this matters, count non-empty lines differently
grep -c '' no_trailing_newline.txt
# 3 (correct — counts lines with content)

POSIX defines a "line" as text terminated by a newline. A file without a trailing newline has an incomplete final line that wc -l does not count. Most text editors add trailing newlines, but programmatically generated files often do not.

8. sort -n vs -h vs -g (numeric types are not interchangeable)¶

# -n: integer sort (stops at first non-numeric character)
echo -e "1.5\n2\n10\n1.1" | sort -n
# 1.1
# 1.5
# 2
# 10
# Works for simple decimals

# -g: general numeric (handles scientific notation)
echo -e "1e3\n2.5e2\n1e2" | sort -g
# 1e2    (100)
# 2.5e2  (250)
# 1e3    (1000)

# -n gets this wrong:
echo -e "1e3\n2.5e2\n1e2" | sort -n
# 1e2
# 1e3    <-- wrong! 1 < 2.5 numerically, but sort -n stops at 'e'
# 2.5e2

# -h: human-readable (K, M, G, T suffixes)
echo -e "1G\n500M\n2G\n100K" | sort -h
# 100K
# 500M
# 1G
# 2G

# -n gets this completely wrong:
echo -e "1G\n500M\n2G\n100K" | sort -n
# 1G
# 2G
# 100K
# 500M   <-- sorted 1, 2, 100, 500 (ignoring suffixes)

Use -n for integers and simple decimals. Use -g for scientific notation. Use -h for human-readable sizes (output of du -h, ls -lh). Do not mix them.

9. cut with multi-character delimiter (it cannot)¶

You have a file delimited by :: or || and try to use cut.

echo "field1::field2::field3" | cut -d'::' -f2
# cut: the delimiter must be a single character
# Try '--help' for more information.

# Fix: use awk
echo "field1::field2::field3" | awk -F'::' '{print $2}'
# field2

# Or replace the delimiter first
echo "field1::field2::field3" | sed 's/::/\t/g' | cut -f2
# field2

cut -d accepts exactly one character. For multi-character delimiters, use awk -F.

10. comm requiring sorted input (silently wrong otherwise)¶

comm compares two files but requires both to be sorted. If either file is unsorted, the output is silently wrong — no error, no warning.

# Unsorted files
echo -e "banana\napple\ncherry" > file1.txt
echo -e "apple\ndate\nbanana" > file2.txt

# comm silently produces garbage
comm file1.txt file2.txt
# banana
#         apple
#   apple
#                 banana    <-- banana appears in "both" AND "only file1"?
#   cherry
#         date

# Fix: always sort inputs
comm <(sort file1.txt) <(sort file2.txt)
#                 apple
#                 banana
# cherry
#         date

Use process substitution <(sort file) to sort inline. Never trust comm output from unsorted files — it will not warn you.

11. head -n 0 behavior varies¶

# GNU head: -n 0 outputs nothing (expected)
seq 10 | head -n 0
# (no output)

# But: head -n -0 outputs EVERYTHING on some implementations
seq 10 | head -n -0
# 1 through 10 (all lines!)
# Because "all except the last 0 lines" = all lines

# Safer: always check for the edge case in scripts
COUNT=${1:-10}
if [ "$COUNT" -gt 0 ]; then
  head -n "$COUNT" file.txt
fi

12. sort -k without field end (sorting more than you think)¶

# Data: name,department,salary
echo -e "alice,eng,100000\nbob,eng,95000\ncarol,sales,100000" > employees.csv

# Sort by salary (field 3) — WRONG
sort -t, -k3 employees.csv
# alice,eng,100000
# carol,sales,100000
# bob,eng,95000
# alice and carol have the same salary (100000) — but sort also compared
# everything AFTER field 3, which is nothing here. With more fields, this matters.

# WRONG: sort by department
sort -t, -k2 employees.csv
# This sorts by field 2 THROUGH END OF LINE
# So "eng,100000" is compared as a whole string against "eng,95000"

# RIGHT: sort by department only
sort -t, -k2,2 employees.csv
# This sorts by field 2 and only field 2

# The difference matters when you have ties in the sort key
# -k2 breaks ties using everything after field 2
# -k2,2 preserves original order for ties (with -s for stable sort)

Always specify both start and end of the key: -k2,2 not -k2. This is the second most common text processing mistake after forgetting to sort before uniq.