Regex & Text Wrangling — Trivia & Interesting Facts¶
Surprising, historical, and little-known facts about regular expressions and text processing.
Regular expressions were invented by a mathematician in 1951¶
Stephen Cole Kleene, an American mathematician, defined regular expressions in 1951 as a notation for describing "regular sets" — a concept from automata theory. Ken Thompson implemented them in the QED editor in 1968, then in ed (the original Unix text editor, 1969), and then in grep (1973). The theoretical foundation predates the first use in computing by nearly two decades.
grep stands for g/re/p — a command from the ed editor¶
grep literally comes from the ed editor command g/re/p, meaning "globally search for a regular expression and print matching lines." Ken Thompson wrote the standalone grep tool in 1973 after finding himself repeatedly using this ed command. The name is so embedded in computing culture that "to grep" has become a verb meaning "to search."
There are at least 7 different regex dialects in common use¶
POSIX BRE, POSIX ERE, Perl-Compatible (PCRE), JavaScript, Python re, Java, and .NET regex all have subtly different syntax. \d means "digit" in PCRE but is not valid in POSIX. {n,m} requires backslash escaping in BRE (\{n,m\}) but not in ERE. The metacharacter + does not exist in BRE. These differences cause endless porting bugs.
sed was written in 1974 and its syntax has not changed since¶
sed (stream editor) was written by Lee McMahon at Bell Labs in 1974. Its s/pattern/replacement/g syntax is perhaps the most widely recognized regex idiom in computing. Despite being 50 years old, sed is installed on every Unix/Linux system and used in millions of scripts. The GNU version added -i (in-place editing) — the most-used flag that was not in the original.
awk is named after three people¶
awk was created by Alfred Aho, Peter Weinberger, and Brian Kernighan at Bell Labs in 1977. The name is their initials: A-W-K. awk is a complete programming language with variables, arrays, functions, and printf — far more than a text filter. Many programs that should be awk scripts are written as Python scripts simply because awk is perceived as obscure.
Catastrophic backtracking can freeze your entire system¶
Naive regex engines (NFA-based, used in PCRE, Python, Java) can take exponential time on certain pattern-input combinations. The classic example: (a+)+$ matching against "aaaaaaaaaaaaaaaaX" backtracks 2^n times. This has caused real outages: Cloudflare's 2019 global outage was triggered by a single regex with catastrophic backtracking in a WAF rule. RE2 (Google) and Rust's regex crate solve this by using Thompson NFA algorithms.
Perl Compatible Regular Expressions are not actually compatible with Perl¶
PCRE, written by Philip Hazel in 1997, aims to provide "Perl-compatible" regex but has diverged from actual Perl regex in multiple ways over the years. Perl has added features (like \R for any line ending) that PCRE adopted later, and PCRE has features Perl lacks. Despite the name, PCRE is its own dialect used by PHP, Apache, Nginx, and many C/C++ programs.
The tr command is the simplest and most underused text tool¶
tr (translate) performs single-character substitution and deletion. tr 'a-z' 'A-Z' converts to uppercase. tr -d '\r' removes carriage returns. tr -s ' ' squeezes repeated spaces. It was written by Ken Thompson and is faster than sed for character-level operations because it uses a simple lookup table instead of a regex engine.
cut came before awk but does one thing better¶
cut extracts columns from text using delimiter-based or character-position-based splitting. For simple column extraction (like getting field 3 from a CSV), cut -d',' -f3 is dramatically simpler and faster than awk -F',' '{print $3}'. cut's limitation — it cannot reorder fields or handle multi-character delimiters — is exactly why awk exists.
Look-ahead and look-behind are zero-width — they match a position, not text¶
Regex look-ahead (?=...) and look-behind (?<=...) assert that a pattern exists before or after the current position without consuming characters. This is conceptually difficult: they answer "is this pattern present?" without including it in the match. This enables extracting a word after "Price:" without including "Price:" in the result: (?<=Price:)\s*\d+.
The world's most-used regex is probably for email validation¶
Email validation regex ranges from the simple (\S+@\S+) to the infamous RFC 5322 compliant version, which is over 6,000 characters long. The correct answer, per the RFC, is that email addresses can contain quoted strings, comments, and IP address literals, making perfect regex validation nearly impossible. Most production systems use a simple check and then send a verification email.
sort and uniq are separate commands for a reason — but everyone pipes them together¶
sort | uniq is so common that sort -u exists as a shortcut. They are separate commands because of the Unix philosophy (each tool does one thing), but also because uniq only removes adjacent duplicates — it requires sorted input to remove all duplicates. This design choice confuses newcomers who expect uniq alone to work like Python's set().