Skip to content

Regex

← Back to all decks

26 cards — 🟢 5 easy | 🟡 9 medium | 🔴 6 hard

🟢 Easy (5)

1. What grep flag enables Extended Regular Expressions (ERE) to avoid backslash escaping?

Show answer grep -E (or egrep). ERE allows +, ?, {n,m}, and () without backslash escaping, unlike BRE where you must write \+, \?, \{n,m\}, and \(\).

Remember: "Regex = pattern matching language." The dot . matches any character except newline (unless DOTALL flag).

Example: grep -E "^error" log.txt matches lines starting with "error."

2. How do you replace all occurrences of a pattern on every line in a file using sed?

Show answer sed 's/old/new/g' file. The g flag means global (all occurrences per line). Without g, only the first occurrence on each line is replaced.

Remember: "Star = zero or more, Plus = one or more, Question = zero or one."

Example: colou?r matches both "color" and "colour". The ? makes the u optional.

3. How do you print the second column of a space-delimited file using awk?

Show answer awk '{print $2}' file. Use -F to change the delimiter, e.g., awk -F: '{print $1, $3}' /etc/passwd for colon-delimited files.

Remember: "Brackets = character class = pick one." [aeiou] matches any vowel. [^aeiou] matches any non-vowel.

Example: [0-9]{3}-[0-9]{4} matches phone-number-like patterns like 555-1234.

4. What are the most common regex character classes and what do they match?

Show answer . matches any character (except newline). \d matches a digit [0-9]. \w matches a word character [a-zA-Z0-9_]. \s matches whitespace (space, tab, newline). \b matches a word boundary. Capitalize to negate: \D = non-digit, \W = non-word, \S = non-whitespace.
Note: \d, \w, \s are PCRE-only — use [[:digit:]], [[:alnum:]], [[:space:]] in POSIX tools like sed and awk.

5. What are common regex patterns for matching IPs, emails, and URLs?

Show answer IP address: [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+ (simple, allows invalid octets) or more strict with range checks.
Email (basic): [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
URL: https?://[^\s]+ (simple) or https?://[a-zA-Z0-9.-]+(/[^\s]*)? for more structure.
These are pragmatic patterns for log extraction, not RFC-compliant validators.

Remember: "\\b = word boundary." It matches the position between a word character and a non-word character. \\bcat\\b matches "cat" but not "concatenate."

🟡 Medium (9)

1. What are the three regex dialects and their key escaping differences?

Show answer BRE (Basic, used by grep/sed): grouping \(\), quantifier \+, \?. ERE (Extended, used by grep -E, awk): grouping (), quantifier +, ?. PCRE (Perl, used by grep -P): same as ERE plus lookahead (?=...), (?!...), non-greedy *?, +?, and \d, \w, \s shortcuts.

Gotcha: .* is greedy by default — it matches as much as possible. Use .*? for non-greedy (lazy) matching.

Remember: "Greedy grabs all, lazy stops at first match."

2. How do you use capture groups and backreferences in sed to reformat a date?

Show answer echo "2024-03-15" | sed -E 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\3\/\2\/\1/' outputs 15/03/2024. Parentheses capture groups and \1, \2, \3 reference them. Use -E for extended regex to avoid escaping parentheses.

Remember: "Parens capture, pipe alternates." (cat|dog) matches "cat" or "dog" and captures the match.

3. How do you count occurrences by key and compute a sum using awk?

Show answer Count by key: awk '{count[$1]++} END {for (k in count) print k, count[k]}'. Sum a column: awk '{sum += $3} END {print sum}'. Average: awk '{sum += $3; n++} END {print sum/n}'. Awk processes line by line, running the main block per line and END once after all input.

4. How do you extract all IP addresses from a log file, count unique occurrences, and show the top 20?

Show answer grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' access.log | sort | uniq -c | sort -rn | head -20. The -o flag outputs only matching text, one per line. sort | uniq -c counts unique lines, and sort -rn orders by count descending.

5. What is the difference between greedy and lazy quantifiers in regex?

Show answer Greedy quantifiers (*, +, {n,m}) match as much as possible. Lazy quantifiers (*?, +?, {n,m}?) match as little as possible.
Example: given "bold", the greedy pattern <.*> matches "bold" (entire string), while <.*?> matches "" (first tag only). Lazy quantifiers require PCRE (grep -P). In BRE/ERE, use negated character classes instead: <[^>]*> matches a single tag.

6. How do capture groups and backreferences work in regex?

Show answer Parentheses () create capture groups that save matched text. Backreferences (\1, \2) reuse captured text.
Example: (.)\1 matches any character followed by itself (aa, bb, 11). In sed: echo "John Smith" | sed -E 's/(\w+) (\w+)/\2, \1/' outputs "Smith, John". Non-capturing groups (?:...) group without saving — available in PCRE only.

7. What are the practical differences between PCRE and POSIX regex in everyday use?

Show answer POSIX (grep, sed, awk): portable across all Unix, two flavors (BRE and ERE). No lookahead, no lazy quantifiers, no \d shorthand. Use [[:digit:]] instead of \d.
PCRE (grep -P, Perl, Python): richer features — lookahead/lookbehind, lazy quantifiers, non-capturing groups, \d/\w/\s shortcuts, named groups (?P...).
Rule of thumb: use POSIX for shell scripts that must be portable, PCRE for complex extraction tasks.

8. What is a lookahead assertion in regex and when would you use one?

Show answer A lookahead (?=...) matches a position where the pattern ahead matches, without consuming characters. Used to validate constraints (e.g., password must contain a digit) without advancing the cursor.

Gotcha: Lookaheads (?=...) and lookbehinds (?<=...) match a position, not characters. They don't consume input.

Remember: "Lookaround = peek without eating."

9. Explain the difference between capturing groups and non-capturing groups in regex.

Show answer Capturing groups (...) store the matched text for back-references or extraction. Non-capturing groups (?:...) group without storing, which is faster when you only need grouping for alternation or quantifiers.

Remember: "Anchors: ^ = start of line, $ = end of line." They match positions, not characters.

🔴 Hard (6)

1. What are POSIX character classes and why should you use them over shortcuts like \d?

Show answer POSIX classes like [[:digit:]], [[:alpha:]], [[:alnum:]], [[:space:]] are portable across sed, awk, and grep in all locales. The \d, \w, \s shortcuts are PCRE-only (grep -P) and not available in standard sed or awk. For portable scripts, always use POSIX classes.

2. How would you use awk to find slow requests in an Apache log and compute average response time?

Show answer Slow requests (> 1 second, assuming time in column 10): awk '$10 > 1000 {print $7, $10"ms"}' access.log. Average response time: awk '{sum += $10; n++} END {printf "avg: %.2fms
", sum/n}' access.log. Status code distribution: awk '{print $9}' access.log | sort | uniq -c | sort -rn.

3. How do you perform operations on a range of lines between two patterns using sed?

Show answer sed '/START/,/END/s/foo/bar/g' file replaces foo with bar only on lines between START and END patterns (inclusive). You can also delete a range: sed '/START/,/END/d'. Insert before a match: sed '/pattern/i text'. Append after: sed '/pattern/a text'. Combine with -n and p to extract ranges: sed -n '/START/,/END/p'.

4. What are lookahead and lookbehind assertions and when are they useful?

Show answer Lookahead (?=...) matches a position followed by a pattern without consuming it. Lookbehind (?<=...) matches preceded by a pattern. Negative versions: (?!...) and (?Example: \d+(?= USD) matches digits before " USD" without including " USD" in the match. Use case: extracting values adjacent to labels without including the label. Only available in PCRE (grep -P), not in sed or awk.

5. How do you use regex differently in sed, awk, and grep?

Show answer grep: filter lines matching a pattern (grep -E 'ERROR|WARN' log.txt).
sed: find and replace within lines (sed -E 's/old/new/g' file).
awk: split fields and apply logic (awk '/ERROR/ {print $1, $NF}' log.txt).
Key difference: grep selects lines, sed transforms text, awk processes structured data. Combine them: grep filters, sed cleans, awk aggregates. All three support ERE with -E (grep/sed) or natively (awk).

6. Why can certain regex patterns cause catastrophic backtracking, and how do you prevent it?

Show answer Nested quantifiers like (a+)+ on non-matching input cause exponential backtracking. Prevent by using atomic groups, possessive quantifiers, or restructuring to avoid ambiguous repetition.

Remember: "Backslash-d = digit, backslash-w = word char, backslash-s = space." Uppercase versions negate: \\D = non-digit.