Quiz: Regex & Text Wrangling¶

7 questions

L0 (2 questions)¶

1. What is the difference between BRE (Basic Regular Expressions) and ERE (Extended Regular Expressions) when using grep?

Show answer

In BRE (default grep), metacharacters like +, ?, |, and () must be escaped with backslashes to be special. In ERE (grep -E), they are special by default without escaping. ERE is generally preferred for readability. *Common mistake:* People often forget that plain grep uses BRE where + means a literal plus sign, not 'one or more'.

2. What is the difference between . (greedy) and .? (lazy) in regex?

Show answer

Greedy (.*) matches as much as possible, then backtracks. Lazy (.*?) matches as little as possible. Example: on 'text', '<.*>' matches the entire string, while '<.*?>' matches just ''. Use lazy quantifiers when you want the shortest match.

L1 (3 questions)¶

1. You need a portable regex to match IP addresses in a log file. Why should you avoid \d and use [0-9] instead?

Show answer

\d is PCRE-only (grep -P). Standard grep -E and sed -E do not support \d. For portability across BRE/ERE/PCRE, use [0-9] or the POSIX class [[:digit:]].

2. How do lookahead and lookbehind assertions work in regex?

Show answer

Lookahead (?=pattern) asserts what follows without consuming. Lookbehind (?<=pattern) asserts what precedes. Negative forms: (?!pattern) and (?

3. What are named capture groups and backreferences in regex?

Show answer

Named groups: (?Ppattern) in Python/PCRE, (?pattern) in others. Reference with \k or \1 for numbered groups. Example: (?P\w+)\s+(?P=word) matches repeated words like 'the the'. In sed: \(pattern\) and \1 for backreferences.

L2 (1 questions)¶

1. Write a sed one-liner that extracts just the HTTP status code from an Nginx access log line like: '192.168.1.1 - - [15/Mar/2024:14:23:01 +0000] "GET /api HTTP/1.1" 200 1234'.

Show answer

sed -E 's/.* "[A-Z]+ [^ ]+ HTTP\/[0-9.]+" ([0-9]+) .*/\1/' — this captures the three-digit status code after the request line using a capture group. *Common mistake:* A common mistake is not escaping the forward slashes or trying to use \d which does not work in sed.

L3 (1 questions)¶

1. You have a 50GB log file and need to find all lines where a request took longer than 1 second (field format: duration=0.XXXs). How do you approach this efficiently?

Show answer

Use grep -P 'duration=[1-9][0-9]*\.' for whole-second matches, or grep -E 'duration=[0-9]+\.[0-9]+s' piped through awk to filter numerically. For 50GB files, use ripgrep (rg) for speed or split the file and search in parallel. Avoid loading into memory.