Regex & Text Wrangling Footguns¶
-
Using greedy quantifiers when you want minimal matching. You write
grep -oE '<.*>'to extract HTML tags and get the entire line from first<to last>instead of individual tags. Greedy.*consumes everything. Fix: Use negated character classes (<[^>]+>) for portability, or PCRE non-greedy quantifiers (grep -oP '<.*?>'). Default to negated classes — they work everywhere.War story: On July 2, 2019, a regex with nested greedy quantifiers (
.*.*) in Cloudflare's WAF caused catastrophic backtracking, taking down their entire CDN for 27 minutes and affecting billions of requests. The pattern was deployed globally with no canary rollout. Greedy quantifiers are not just a correctness issue — they can be a denial-of-service vector. -
Running
sed -iwithout a backup on production configs. You fat-finger a regex insed -i 's/...' nginx.confand the file is now corrupted. There is no undo. Nginx will not start. Fix: Always usesed -i.bakto create a backup. Better yet, preview withsed 's/.../' file | diff file -before applying. On macOS,sed -i ''is the syntax (empty extension), which creates no backup — avoid it in scripts.One-liner: Preview before destroying:
sed 's/old/new/g' file | diff file -. If the diff looks right, then apply withsed -i.bak 's/old/new/g' file. -
Forgetting the difference between BRE and ERE. You write
grep 'error|fatal' logfileexpecting OR logic, but grep uses Basic Regular Expressions by default, so it matches the literal stringerror|fatal. You get zero hits and conclude there are no errors. Fix: Usegrep -E(extended regex) oregrepfor alternation, quantifiers (+,?), and grouping without backslashes. Makegrep -Eyour default habit. -
Locale-dependent character ranges. Your script uses
grep '[A-Z]'to find uppercase letters. On your laptop it works fine. On a production server with a different locale,[A-Z]matches lowercase letters too because of collation ordering. Fix: Use POSIX character classes ([[:upper:]],[[:digit:]]) or explicitly setLC_ALL=Cbefore running text-processing commands in scripts.Gotcha: In
en_US.UTF-8locale,[A-Z]matchesathroughy(lowercase!) because the locale sortsaAbBcC...yYzZ. OnlyzandZbehave as expected. This is not a bug — it is the POSIX collation specification.LC_ALL=Cforces byte-order sorting where[A-Z]means what you think it means. -
Using
awkorsedto parse CSV with embedded commas or quotes. Your CSV has fields like"Smith, John",42,New York. A naiveawk -F,splits this into four fields instead of three because it does not understand quoting. Fix: Use a proper CSV tool:csvkit(csvcut,csvgrep),miller(mlr), or Python'scsvmodule. For quick-and-dirty work on simple CSVs without embedded commas, awk is fine — but know its limits. -
Piping to
sort | uniqwhen the input is not sorted.uniqonly removes adjacent duplicates. If your input isa b a b,uniqoutputsa b a b(nothing changed). You get wrong counts withuniq -c. Fix: Alwayssortbeforeuniq. The pattern issort | uniq -c | sort -rn. Alternatively, useawk '!seen[$0]++'to deduplicate without sorting (preserves original order). -
Forgetting that
sedandawkregex is line-oriented. You try to match a multi-line pattern withsed 's/start.*end/replacement/'and it never matches because.*does not cross line boundaries in sed. Fix: For multi-line sed, use theNcommand to pull in the next line (sed '/start/{N;s/start\n.*end/replacement/}'). For complex multi-line work, reach forawkwithRS=""orperl -0. -
Escaping hell in nested quoting. You write a regex inside a bash script inside a
sedcommand and lose track of which characters need escaping at which level. Bash eats backslashes before sed sees them. Fix: Use single quotes around sed/awk expressions — single quotes prevent all shell interpretation. If you need a shell variable inside the expression, close the single quote, add the variable, and reopen:sed 's/old/'"$var"'/g'. Or use awk's-vflag:awk -v val="$var" '{gsub(/old/, val)}'. -
Writing a complex regex instead of chaining simple commands. You write one monstrous regex to parse, filter, and extract in a single grep. It takes 20 minutes to debug and is unreadable by anyone, including future you. Fix: Pipe simple commands together.
grep 'ERROR' | awk '{print $5}' | sort -uis more readable and debuggable than a single regex that does all three. Unix philosophy: small tools, connected by pipes. -
Using
grep -rwithout limiting the search scope. You rungrep -r 'password' /and grep crawls through binary files, /proc, /sys, and mounted NFS volumes. It takes forever and returns garbage matches from binaries. Fix: Limit the scope:grep -r --include='*.conf' 'password' /etc/. Exclude binary files withgrep -rI. Exclude specific directories with--exclude-dir=.git. Or useripgrep(rg), which respects .gitignore and skips binary files by default.Remember:
grep -rfollows symlinks and crosses filesystem boundaries. On a Linux host,/procand/sysare virtual filesystems with thousands of pseudo-files. NFS mounts add network latency per file. Always scope your search to a specific directory tree.