What Happens When You Type a Regex
- lesson
- regex-internals
- nfa/dfa
- backtracking
- catastrophic-regex
- grep-vs-pcre
- performance ---# What Happens When You Type a Regex
Topics: regex internals, NFA/DFA, backtracking, catastrophic regex, grep vs PCRE, performance Level: L1–L2 (Foundations → Operations) Time: 45–60 minutes Prerequisites: Basic regex usage (matching patterns)
The Mission¶
You write a regex to validate email addresses. It works in tests. In production, a single malformed input causes your regex to run for 30 seconds, consuming 100% CPU. Your service is effectively DoS'd by a string.
This is catastrophic backtracking — a regex that takes exponential time on certain inputs. It's a real vulnerability (ReDoS) and it's in more production code than anyone wants to admit. Understanding how regex engines actually work prevents this.
Two Types of Regex Engine¶
Every regex engine is one of two types:
DFA (Deterministic Finite Automaton)¶
Used by: grep, awk, RE2 (Google)
The engine builds a state machine from the pattern and processes the input string character by character, making exactly one state transition per character. Time is always linear in the length of the input — O(n).
NFA (Nondeterministic Finite Automaton) with backtracking¶
Used by: Python re, JavaScript, Java, Ruby, Perl, grep -P, PCRE
The engine tries to match the pattern by exploring possibilities. When a choice point exists
(like .* followed by more pattern), it tries one path. If that path fails, it backtracks
and tries the other. On certain patterns and inputs, this backtracking can be exponential.
Pattern: (a+)+$
Input: aaaaaaaaaaaaaaaaX
NFA tries: aaaaaaaaaaaaaaaa → X doesn't match $
Backtrack: aaaaaaaaaaaaaaa + a → X doesn't match
Backtrack: aaaaaaaaaaaaaa + aa → X doesn't match
Backtrack: aaaaaaaaaaaaaa + a + a → X doesn't match
... 2^16 combinations to exhaust!
Mental Model: DFA is like driving on a one-way road system — you can only go forward, one turn per intersection, and you always reach the destination (or a dead end) in predictable time. NFA with backtracking is like exploring a maze — when you hit a wall, you go back and try another path. Most mazes are simple. But some mazes have exponentially many dead ends.
Catastrophic Backtracking: The ReDoS Vulnerability¶
ReDoS (Regular Expression Denial of Service) exploits patterns where NFA backtracking is exponential. The attacker sends a crafted input that makes your regex engine consume all CPU.
Dangerous patterns (all contain nested quantifiers):
(a+)+ # Nested quantifiers
(a|aa)+ # Overlapping alternatives
(.*a){10} # Quantified group with greedy match
([a-zA-Z]+)* # Quantified group matching the same characters
Safe versions:
a+ # No nesting
(a|b)+ # Non-overlapping alternatives
.{0,100}a # Bounded quantifier instead of .*
[a-zA-Z]+ # No unnecessary grouping with quantifier
Testing for catastrophic backtracking¶
import re
import time
pattern = re.compile(r'(a+)+$')
for length in [10, 15, 20, 25]:
test_input = 'a' * length + 'X'
start = time.time()
pattern.match(test_input)
elapsed = time.time() - start
print(f"Length {length}: {elapsed:.3f}s")
# → Length 10: 0.001s
# → Length 15: 0.035s
# → Length 20: 1.100s ← exponential growth!
# → Length 25: 35.000s ← production DoS
grep vs PCRE: The Dialect Divide¶
| Feature | grep -E (ERE/POSIX) |
grep -P (PCRE) |
Python re |
|---|---|---|---|
| Engine | DFA (usually) | NFA (backtracking) | NFA (backtracking) |
\d (digit) |
❌ Use [0-9] |
✅ | ✅ |
\b (word boundary) |
❌ | ✅ | ✅ |
| Lookahead/lookbehind | ❌ | ✅ | ✅ |
Non-greedy .*? |
❌ | ✅ | ✅ |
Named groups (?P<n>...) |
❌ | ✅ | ✅ |
Backreferences \1 |
❌ (basic grep only) | ✅ | ✅ |
| Catastrophic backtracking | ❌ (linear time) | ✅ (vulnerable) | ✅ (vulnerable) |
Gotcha:
\ddoes NOT work ingrep -E,sed, orawk. Use[0-9]for portable regex. This is the #1 regex portability gotcha — people assume\dis universal because it works in Python and JavaScript.Name Origin: "grep" stands for
g/re/p— theededitor command that means "globally search for a regular expression and print matching lines." It was extracted into a standalone tool by Ken Thompson on a PDP-11 in 1973.
Regex in Production: Safety Rules¶
1. Always set a timeout¶
import re
import signal
def timeout_handler(signum, frame):
raise TimeoutError("Regex took too long")
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(1) # 1 second timeout
try:
re.match(pattern, untrusted_input)
except TimeoutError:
log.warning("Regex timeout on user input")
finally:
signal.alarm(0)
2. Avoid nested quantifiers on user input¶
If the input comes from users (form fields, API parameters, log entries), never apply a regex with nested quantifiers. Either simplify the pattern or use Google's RE2 (guaranteed linear time):
# Use RE2 for user-facing validation (no backtracking)
import re2 # pip install google-re2
re2.match(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', user_email)
3. Prefer specific patterns over greedy .*¶
# BAD — .* matches everything, then backtracks
r'".*"'
# On input: "a" "b" "c" → .* matches from first " to last "
# GOOD — [^"]* can't match past a quote
r'"[^"]*"'
# Matches exactly one quoted string, no backtracking
Flashcard Check¶
Q1: DFA vs NFA — what's the practical difference?
DFA: linear time, no backtracking, limited features (no backreferences, lookahead). NFA: backtracking, can be exponential, supports advanced features. grep uses DFA; Python/JavaScript use NFA.
Q2: Pattern (a+)+$ on input aaaaaX — why is it slow?
Nested quantifiers with overlapping matches. The NFA tries every way to partition the
as between the innera+and outer()+, then fails onX. Exponential combinations.
Q3: \d in grep -E — does it work?
No.
\dis PCRE syntax.grep -Euses POSIX ERE, which doesn't support\d. Use[0-9]for portability across grep, sed, and awk.
Q4: How do you make regex safe for user input?
Avoid nested quantifiers. Use bounded quantifiers (
{0,100}instead of*). Set a timeout. Or use RE2 (Google's regex engine, guaranteed linear time).
Cheat Sheet¶
Safe Regex Patterns¶
| Unsafe | Safe | Why |
|---|---|---|
.* |
[^X]* (where X is delimiter) |
Bounded by delimiter, no backtracking |
(a+)+ |
a+ |
Remove nested quantifiers |
(a\|aa)+ |
a+ |
Non-overlapping alternatives |
(.*)\\1 |
Use substring matching instead | Backreferences can cause exponential time |
Regex Dialect Quick Reference¶
| What you want | ERE (grep -E, sed, awk) | PCRE (grep -P, Python, JS) |
|---|---|---|
| Digit | [0-9] |
\d |
| Word character | [a-zA-Z0-9_] |
\w |
| Whitespace | [[:space:]] |
\s |
| Word boundary | N/A | \b |
| Non-greedy | N/A | .*? |
| Named group | N/A | (?P<name>...) |
Takeaways¶
-
Regex engines have two types. DFA (linear, safe) and NFA (backtracking, potentially exponential). Know which your language uses.
-
Nested quantifiers on untrusted input = DoS vector.
(a+)+,(a|aa)+,(.*a){n}can all be exploited. Simplify or use RE2. -
\dis not universal. It works in Python, JavaScript, andgrep -P. It does NOT work ingrep -E,sed, orawk. Use[0-9]for portability. -
Prefer
[^X]*over.*. Matching "everything except the delimiter" is specific, fast, and doesn't backtrack..*is greedy and causes most backtracking problems. -
grep = g/re/p. From the ed editor. 1973. Still the most-used regex tool on Earth.
Related Lessons¶
- strace: Reading the Matrix — when regex CPU usage shows up as unexpected syscalls
- The Mysterious Latency Spike — when a slow regex causes a latency spike