What Happens When You Type a Regex

lesson
regex-internals
nfa/dfa
backtracking
catastrophic-regex
grep-vs-pcre
performance ---# What Happens When You Type a Regex

Topics: regex internals, NFA/DFA, backtracking, catastrophic regex, grep vs PCRE, performance Level: L1–L2 (Foundations → Operations) Time: 45–60 minutes Prerequisites: Basic regex usage (matching patterns)

The Mission¶

You write a regex to validate email addresses. It works in tests. In production, a single malformed input causes your regex to run for 30 seconds, consuming 100% CPU. Your service is effectively DoS'd by a string.

This is catastrophic backtracking — a regex that takes exponential time on certain inputs. It's a real vulnerability (ReDoS) and it's in more production code than anyone wants to admit. Understanding how regex engines actually work prevents this.

Two Types of Regex Engine¶

Every regex engine is one of two types:

DFA (Deterministic Finite Automaton)¶

Used by: grep, awk, RE2 (Google)

The engine builds a state machine from the pattern and processes the input string character by character, making exactly one state transition per character. Time is always linear in the length of the input — O(n).

NFA (Nondeterministic Finite Automaton) with backtracking¶

Used by: Python re, JavaScript, Java, Ruby, Perl, grep -P, PCRE

The engine tries to match the pattern by exploring possibilities. When a choice point exists (like .* followed by more pattern), it tries one path. If that path fails, it backtracks and tries the other. On certain patterns and inputs, this backtracking can be exponential.

Pattern: (a+)+$
Input:   aaaaaaaaaaaaaaaaX

NFA tries: aaaaaaaaaaaaaaaa → X doesn't match $
Backtrack: aaaaaaaaaaaaaaa + a → X doesn't match
Backtrack: aaaaaaaaaaaaaa + aa → X doesn't match
Backtrack: aaaaaaaaaaaaaa + a + a → X doesn't match
... 2^16 combinations to exhaust!

Mental Model: DFA is like driving on a one-way road system — you can only go forward, one turn per intersection, and you always reach the destination (or a dead end) in predictable time. NFA with backtracking is like exploring a maze — when you hit a wall, you go back and try another path. Most mazes are simple. But some mazes have exponentially many dead ends.

Catastrophic Backtracking: The ReDoS Vulnerability¶

ReDoS (Regular Expression Denial of Service) exploits patterns where NFA backtracking is exponential. The attacker sends a crafted input that makes your regex engine consume all CPU.

Dangerous patterns (all contain nested quantifiers):

(a+)+           # Nested quantifiers
(a|aa)+         # Overlapping alternatives
(.*a){10}       # Quantified group with greedy match
([a-zA-Z]+)*    # Quantified group matching the same characters

Safe versions:

a+              # No nesting
(a|b)+          # Non-overlapping alternatives
.{0,100}a       # Bounded quantifier instead of .*
[a-zA-Z]+       # No unnecessary grouping with quantifier

Testing for catastrophic backtracking¶

import re
import time

pattern = re.compile(r'(a+)+$')

for length in [10, 15, 20, 25]:
    test_input = 'a' * length + 'X'
    start = time.time()
    pattern.match(test_input)
    elapsed = time.time() - start
    print(f"Length {length}: {elapsed:.3f}s")

# → Length 10: 0.001s
# → Length 15: 0.035s
# → Length 20: 1.100s  ← exponential growth!
# → Length 25: 35.000s ← production DoS

grep vs PCRE: The Dialect Divide¶

Feature	`grep -E` (ERE/POSIX)	`grep -P` (PCRE)	Python `re`
Engine	DFA (usually)	NFA (backtracking)	NFA (backtracking)
`\d` (digit)	❌ Use `[0-9]`	✅	✅
`\b` (word boundary)	❌	✅	✅
Lookahead/lookbehind	❌	✅	✅
Non-greedy `.*?`	❌	✅	✅
Named groups `(?P<n>...)`	❌	✅	✅
Backreferences `\1`	❌ (basic grep only)	✅	✅
Catastrophic backtracking	❌ (linear time)	✅ (vulnerable)	✅ (vulnerable)

Gotcha: \d does NOT work in grep -E, sed, or awk. Use [0-9] for portable regex. This is the #1 regex portability gotcha — people assume \d is universal because it works in Python and JavaScript.

Name Origin: "grep" stands for g/re/p — the ed editor command that means "globally search for a regular expression and print matching lines." It was extracted into a standalone tool by Ken Thompson on a PDP-11 in 1973.

Regex in Production: Safety Rules¶

1. Always set a timeout¶

import re
import signal

def timeout_handler(signum, frame):
    raise TimeoutError("Regex took too long")

signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1)  # 1 second timeout

try:
    re.match(pattern, untrusted_input)
except TimeoutError:
    log.warning("Regex timeout on user input")
finally:
    signal.alarm(0)

2. Avoid nested quantifiers on user input¶

If the input comes from users (form fields, API parameters, log entries), never apply a regex with nested quantifiers. Either simplify the pattern or use Google's RE2 (guaranteed linear time):

# Use RE2 for user-facing validation (no backtracking)
import re2  # pip install google-re2
re2.match(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', user_email)

3. Prefer specific patterns over greedy `.*`¶

# BAD — .* matches everything, then backtracks
r'".*"'
# On input: "a" "b" "c" → .* matches from first " to last "

# GOOD — [^"]* can't match past a quote
r'"[^"]*"'
# Matches exactly one quoted string, no backtracking

Flashcard Check¶

Q1: DFA vs NFA — what's the practical difference?

DFA: linear time, no backtracking, limited features (no backreferences, lookahead). NFA: backtracking, can be exponential, supports advanced features. grep uses DFA; Python/JavaScript use NFA.

Q2: Pattern (a+)+$ on input aaaaaX — why is it slow?

Nested quantifiers with overlapping matches. The NFA tries every way to partition the as between the inner a+ and outer ()+, then fails on X. Exponential combinations.

Q3: \d in grep -E — does it work?

No. \d is PCRE syntax. grep -E uses POSIX ERE, which doesn't support \d. Use [0-9] for portability across grep, sed, and awk.

Q4: How do you make regex safe for user input?

Avoid nested quantifiers. Use bounded quantifiers ({0,100} instead of *). Set a timeout. Or use RE2 (Google's regex engine, guaranteed linear time).

Cheat Sheet¶

Safe Regex Patterns¶

Unsafe	Safe	Why
`.*`	`[^X]*` (where X is delimiter)	Bounded by delimiter, no backtracking
`(a+)+`	`a+`	Remove nested quantifiers
`(a\\|aa)+`	`a+`	Non-overlapping alternatives
`(.*)\\1`	Use substring matching instead	Backreferences can cause exponential time

Regex Dialect Quick Reference¶

What you want	ERE (grep -E, sed, awk)	PCRE (grep -P, Python, JS)
Digit	`[0-9]`	`\d`
Word character	`[a-zA-Z0-9_]`	`\w`
Whitespace	`[[:space:]]`	`\s`
Word boundary	N/A	`\b`
Non-greedy	N/A	`.*?`
Named group	N/A	`(?P<name>...)`

Takeaways¶

Regex engines have two types. DFA (linear, safe) and NFA (backtracking, potentially exponential). Know which your language uses.
Nested quantifiers on untrusted input = DoS vector. (a+)+, (a|aa)+, (.*a){n} can all be exploited. Simplify or use RE2.
\d is not universal. It works in Python, JavaScript, and grep -P. It does NOT work in grep -E, sed, or awk. Use [0-9] for portability.
Prefer [^X]* over .*. Matching "everything except the delimiter" is specific, fast, and doesn't backtrack. .* is greedy and causes most backtracking problems.
grep = g/re/p. From the ed editor. 1973. Still the most-used regex tool on Earth.

strace: Reading the Matrix — when regex CPU usage shows up as unexpected syscalls
The Mysterious Latency Spike — when a slow regex causes a latency spike