Skip to content

Regex & Text Wrangling Footguns

  1. Using greedy quantifiers when you want minimal matching. You write grep -oE '<.*>' to extract HTML tags and get the entire line from first < to last > instead of individual tags. Greedy .* consumes everything. Fix: Use negated character classes (<[^>]+>) for portability, or PCRE non-greedy quantifiers (grep -oP '<.*?>'). Default to negated classes — they work everywhere.

    War story: On July 2, 2019, a regex with nested greedy quantifiers (.*.*) in Cloudflare's WAF caused catastrophic backtracking, taking down their entire CDN for 27 minutes and affecting billions of requests. The pattern was deployed globally with no canary rollout. Greedy quantifiers are not just a correctness issue — they can be a denial-of-service vector.

  2. Running sed -i without a backup on production configs. You fat-finger a regex in sed -i 's/...' nginx.conf and the file is now corrupted. There is no undo. Nginx will not start. Fix: Always use sed -i.bak to create a backup. Better yet, preview with sed 's/.../' file | diff file - before applying. On macOS, sed -i '' is the syntax (empty extension), which creates no backup — avoid it in scripts.

    One-liner: Preview before destroying: sed 's/old/new/g' file | diff file -. If the diff looks right, then apply with sed -i.bak 's/old/new/g' file.

  3. Forgetting the difference between BRE and ERE. You write grep 'error|fatal' logfile expecting OR logic, but grep uses Basic Regular Expressions by default, so it matches the literal string error|fatal. You get zero hits and conclude there are no errors. Fix: Use grep -E (extended regex) or egrep for alternation, quantifiers (+, ?), and grouping without backslashes. Make grep -E your default habit.

  4. Locale-dependent character ranges. Your script uses grep '[A-Z]' to find uppercase letters. On your laptop it works fine. On a production server with a different locale, [A-Z] matches lowercase letters too because of collation ordering. Fix: Use POSIX character classes ([[:upper:]], [[:digit:]]) or explicitly set LC_ALL=C before running text-processing commands in scripts.

    Gotcha: In en_US.UTF-8 locale, [A-Z] matches a through y (lowercase!) because the locale sorts aAbBcC...yYzZ. Only z and Z behave as expected. This is not a bug — it is the POSIX collation specification. LC_ALL=C forces byte-order sorting where [A-Z] means what you think it means.

  5. Using awk or sed to parse CSV with embedded commas or quotes. Your CSV has fields like "Smith, John",42,New York. A naive awk -F, splits this into four fields instead of three because it does not understand quoting. Fix: Use a proper CSV tool: csvkit (csvcut, csvgrep), miller (mlr), or Python's csv module. For quick-and-dirty work on simple CSVs without embedded commas, awk is fine — but know its limits.

  6. Piping to sort | uniq when the input is not sorted. uniq only removes adjacent duplicates. If your input is a b a b, uniq outputs a b a b (nothing changed). You get wrong counts with uniq -c. Fix: Always sort before uniq. The pattern is sort | uniq -c | sort -rn. Alternatively, use awk '!seen[$0]++' to deduplicate without sorting (preserves original order).

  7. Forgetting that sed and awk regex is line-oriented. You try to match a multi-line pattern with sed 's/start.*end/replacement/' and it never matches because .* does not cross line boundaries in sed. Fix: For multi-line sed, use the N command to pull in the next line (sed '/start/{N;s/start\n.*end/replacement/}'). For complex multi-line work, reach for awk with RS="" or perl -0.

  8. Escaping hell in nested quoting. You write a regex inside a bash script inside a sed command and lose track of which characters need escaping at which level. Bash eats backslashes before sed sees them. Fix: Use single quotes around sed/awk expressions — single quotes prevent all shell interpretation. If you need a shell variable inside the expression, close the single quote, add the variable, and reopen: sed 's/old/'"$var"'/g'. Or use awk's -v flag: awk -v val="$var" '{gsub(/old/, val)}'.

  9. Writing a complex regex instead of chaining simple commands. You write one monstrous regex to parse, filter, and extract in a single grep. It takes 20 minutes to debug and is unreadable by anyone, including future you. Fix: Pipe simple commands together. grep 'ERROR' | awk '{print $5}' | sort -u is more readable and debuggable than a single regex that does all three. Unix philosophy: small tools, connected by pipes.

  10. Using grep -r without limiting the search scope. You run grep -r 'password' / and grep crawls through binary files, /proc, /sys, and mounted NFS volumes. It takes forever and returns garbage matches from binaries. Fix: Limit the scope: grep -r --include='*.conf' 'password' /etc/. Exclude binary files with grep -rI. Exclude specific directories with --exclude-dir=.git. Or use ripgrep (rg), which respects .gitignore and skips binary files by default.

    Remember: grep -r follows symlinks and crosses filesystem boundaries. On a Linux host, /proc and /sys are virtual filesystems with thousands of pseudo-files. NFS mounts add network latency per file. Always scope your search to a specific directory tree.