Linux Text Processing — Trivia & History¶
Interesting facts, historical context, and edge cases about the classic Unix text toolkit.
1. sort predates Unix itself¶
The concept of a standalone sort utility appeared in the earliest Unix systems at Bell
Labs in 1971. But the sorting problem was already central to computing — IBM had dedicated
sorting hardware in the 1950s. Ken Thompson wrote the first Unix sort and it could already
handle files larger than memory by doing an external merge sort to disk. Modern GNU sort
still uses the same fundamental approach: split into sorted runs, merge them. The algorithm
has barely changed in 50 years because it did not need to.
2. cut was a latecomer¶
cut did not appear until Unix System III (1982), more than a decade after sort, uniq, and
most of the other text tools. Before cut existed, extracting a column meant writing an
awk '{print $N}' one-liner. Many greybeards still reflexively use awk for column extraction
because they learned Unix before cut existed. Both approaches work fine — awk is more
flexible, cut is faster for simple cases.
3. tr cannot take file arguments — by design¶
tr reads only from stdin. You cannot write tr 'a-z' 'A-Z' file.txt — there is no file
argument. This was a deliberate choice: tr is a pure filter that transforms a stream of
characters. It has no concept of files, lines, or fields. It sees raw bytes. This simplicity
is what makes it fast: tr does not need to buffer lines or parse delimiters.
4. uniq was designed for sorted data (on purpose)¶
uniq only compares adjacent lines because the original Unix system could not afford to hold
an entire file in memory for deduplication. By requiring sorted input, uniq could work in
a single pass with O(1) memory — it only needed to remember the previous line. The sort-then-
uniq pattern was not a limitation; it was the design. Today, awk '!seen[$0]++' can
deduplicate without sorting by using a hash table, but it loads all unique values into memory.
For billion-line files, sort -u is still the only practical option.
5. diff was invented for the legal profession¶
The original diff algorithm was published by Douglas McIlroy in 1976 at Bell Labs. But the
driving use case was not code review — it was comparing legal documents and contracts.
McIlroy's algorithm was based on the longest common subsequence (LCS) problem, which had
been studied in the context of molecular biology (comparing DNA sequences). The unified diff
format (diff -u) was added much later — it was first used in the BSD diff in 1990 and
became the standard format for patches, code review, and eventually git diff.
6. The sort | uniq -c | sort -rn pipeline has a name¶
Practitioners sometimes call it the "frequency count" or "histogram" pipeline. It is
arguably the single most commonly typed pipeline in Unix operations. It answers the question
"what are the N most common values of X?" for any X that can be extracted as a line of text.
It works because:
- sort groups identical lines together
- uniq -c counts each group
- sort -rn orders by count, most frequent first
- head truncates to the top N
There is no built-in command that does this in one step. The four-command pipeline is it.
7. column -t uses Wirth's algorithm¶
The -t flag in column determines column widths by reading the entire input first, finding
the maximum width of each column, and then printing with appropriate padding. This means
column -t cannot stream — it must buffer all input before producing any output. For very
large inputs, this can use significant memory. For streaming tabular output, printf with
fixed widths is more appropriate.
8. paste was named for literal cut-and-paste¶
In the era of typewriters and typesetting, "cut and paste" meant physically cutting strips of
paper and pasting them side by side. The cut command extracts vertical strips (columns) and
paste joins them horizontally. They were designed as complementary tools: cut disassembles
and paste reassembles. The metaphor was literal.
9. comm output format has three columns separated by tabs¶
comm outputs three tab-indented columns. Lines unique to file 1 have no leading tab. Lines
unique to file 2 have one leading tab. Lines common to both have two leading tabs. This
format is surprisingly hard to parse programmatically. The -1, -2, -3 flags suppress
columns, which is almost always what you want. In practice, nobody reads the raw three-column
output — you suppress two of the three columns to get a useful result.
10. wc's "word" definition is unusual¶
wc -w counts "words" defined as sequences of non-whitespace characters separated by
whitespace. This means hello---world is one word (it contains no whitespace). A URL like
https://example.com/path?query=value is one word. An email address is one word. Code like
function(arg1,arg2) is one word. This definition is useless for natural language word
counting but perfectly useful for counting tokens in log files and command output.
11. head and tail have opposite defaults for -n¶
head -n 10 shows the first 10 lines. head -n -10 shows everything except the last 10.
tail -n 10 shows the last 10 lines. tail -n +10 shows everything from line 10 onward.
The sign conventions are different: head uses negative numbers to mean "all but the last N,"
while tail uses a plus sign to mean "starting from line N." These conventions evolved
independently and were never harmonized.
12. sort -R is not cryptographically random¶
sort -R (random sort / shuffle) uses a hash function, not a cryptographic random number
generator. It assigns a random hash to each line and sorts by the hash. This means:
- Identical lines always sort adjacent to each other (same hash)
- The randomness is seeded and reproducible with --random-source
- For true random shuffling of a file with duplicate lines, use shuf instead
shuf (GNU coreutils) uses the Fisher-Yates shuffle algorithm and does not group duplicates.
13. These tools handle files larger than RAM¶
A common misconception is that these tools load entire files into memory. Most do not:
- sort uses external merge sort (temp files on disk) for large inputs
- uniq uses O(1) memory (only remembers the previous line)
- cut, tr, head, tail, wc, nl, fold all stream line by line
- paste streams line by line across multiple files
- column -t is the notable exception — it must buffer everything to calculate widths
- diff needs both files in memory (or uses external temp files for very large inputs)
This is why sort bigfile.txt | uniq -c | sort -rn | head works on a 50GB log file when
your system has 4GB of RAM. The pipeline streams — each tool processes and passes data without
waiting for the full input.
14. The POSIX standard guarantees all of these tools¶
Every tool in this topic pack is specified in the POSIX.1 standard. This means they are required to exist on any POSIX-compliant system: Linux, macOS, FreeBSD, Solaris, AIX, HP-UX. The flags and behaviors described here are the POSIX-portable subset unless noted as "GNU extension." You can write pipelines using these tools and they will work on any Unix system you encounter for the rest of your career. That is a property almost no modern CLI tool can claim.
15. rev has exactly one common use case¶
rev reverses each line character by character. Its single most common use is extracting the
last field from a variable-delimiter string:
This works because cut cannot extract "the last field" — it can only extract by field
number, and you do not always know how many fields there are. Reversing, cutting field 1
(which is now the last), and reversing again is the classic workaround. Some people consider
this a hack; it has been in use since the 1980s and is perfectly idiomatic.
16. tac is cat spelled backwards — literally¶
tac reverses the order of lines in a file (last line first). The name is literally cat
spelled backwards, because it does the reverse operation. This naming convention — a tool
that undoes another tool being named as its reverse — is unique in Unix. There is no dael
to undo lead, no tros to undo sort. tac stands alone in this tradition.