grep & Regular Expressions - Street-Level Ops¶
Real-world patterns and one-liners for production troubleshooting with grep.
Finding Error Patterns in Logs¶
# The workhorse: find all high-severity log lines
grep -E 'ERROR|FATAL|CRIT|PANIC' /var/log/app/server.log
# Case-insensitive, with 3 lines of context after each match
grep -iE 'error|fatal|critical' -A 3 /var/log/app/server.log
# Count errors per hour from timestamped logs
grep -E 'ERROR' app.log | grep -oE '^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}' | uniq -c
# Find errors in rotated and compressed logs
zgrep -c "ERROR" /var/log/app/server.log.*.gz | grep -v ':0$'
# Find the first occurrence of an error (when did it start?)
grep -n "connection pool exhausted" app.log | head -1
# Find the last occurrence (is it still happening?)
grep -n "connection pool exhausted" app.log | tail -1
> **Debug clue:** During incidents, always check both first and last occurrence. If the first and last are minutes apart, the issue is ongoing. If the last was hours ago, it may have self-resolved and you are chasing a ghost. The time delta between first and last is your incident duration estimate.
Remember: When investigating an incident,
grep -c "ERROR" app.log(count) is faster and more useful thangrep "ERROR" app.log | wc -l. The-cflag counts matches per file without piping, and when used with-rit gives you a per-file breakdown instantly:grep -rc "ERROR" /var/log/app/.
Extracting IPs from Logs¶
# Extract all unique IPs from an access log
grep -oE '\b[0-9]{1,3}(\.[0-9]{1,3}){3}\b' access.log | sort -u
# Top 20 IPs by request count
grep -oE '\b[0-9]{1,3}(\.[0-9]{1,3}){3}\b' access.log | sort | uniq -c | sort -rn | head -20
# Find all requests from a specific subnet
grep -E '\b10\.0\.3\.[0-9]{1,3}\b' access.log
# Extract IPs that triggered 5xx errors (assuming combined log format)
awk '$9 ~ /^5/' access.log | grep -oE '\b[0-9]{1,3}(\.[0-9]{1,3}){3}\b' | sort | uniq -c | sort -rn
# Find IPs hitting a specific endpoint
grep 'POST /api/login' access.log | grep -oE '\b[0-9]{1,3}(\.[0-9]{1,3}){3}\b' | sort | uniq -c | sort -rn
Searching Codebases¶
# Find all TODO/FIXME/HACK comments
grep -rnE '(TODO|FIXME|HACK|XXX|WARN):?' --include='*.py' --include='*.go' .
# Find function definitions in Python
grep -rnE '^\s*def\s+\w+' --include='*.py' ./src/
# Find all imports of a specific package
grep -rn 'import requests' --include='*.py' .
grep -rn '"github.com/lib/pq"' --include='*.go' .
# Find hardcoded passwords or secrets (audit pattern)
grep -rnE '(password|secret|api_key|token)\s*=\s*["\x27][^"\x27]+["\x27]' --include='*.py' --include='*.yaml' --include='*.env' .
> **One-liner:** This audit pattern catches the low-hanging fruit but misses secrets in JSON, base64-encoded strings, and non-obvious variable names. For real secret scanning, use dedicated tools like `gitleaks` or `trufflehog` that detect entropy-based patterns and known secret formats.
# Find files that reference a database host
grep -rl 'db-primary\|db-replica\|postgres://' --include='*.yaml' --include='*.env' --include='*.py' .
# Find where an environment variable is used
grep -rn 'DATABASE_URL' --include='*.py' --include='*.sh' --include='*.yaml' .
Finding Config Values¶
# Extract key=value pairs from config files
grep -E '^\s*[a-zA-Z_][a-zA-Z0-9_]*\s*=' /etc/app/config.ini
# Find all non-comment, non-empty lines in a config
grep -vE '^\s*(#|;|$)' /etc/nginx/nginx.conf
# Find all listen directives in nginx configs
grep -rn 'listen' /etc/nginx/sites-enabled/
# Find all enabled systemd services
systemctl list-unit-files | grep enabled
# Extract a specific YAML value (crude but effective)
grep -A 1 'database:' config.yaml | grep 'host:'
# Find all ports mentioned in docker-compose
grep -E 'ports:|:\s*"?[0-9]+' docker-compose.yml
grep + awk Pipelines¶
# Top 10 slowest requests from access log (response time in last field)
grep "200 OK" access.log | awk '{print $NF, $7}' | sort -rn | head -10
# Average response time for a specific endpoint
grep 'GET /api/health' access.log | awk '{sum += $NF; n++} END {print sum/n " ms avg over " n " requests"}'
# Find requests that took more than 5 seconds
grep -E 'GET|POST' access.log | awk '$NF > 5000 {print}'
# Extract and count HTTP methods
grep -oE '(GET|POST|PUT|DELETE|PATCH)' access.log | sort | uniq -c | sort -rn
# Find lines where a field exceeds a threshold
grep "cpu_usage" metrics.log | awk -F= '$2 > 90 {print}'
Finding Processes¶
# Find a process (avoiding the grep-matches-itself problem)
ps aux | grep '[n]ginx'
ps aux | grep '[p]ostgres'
# Better: use pgrep
pgrep -la nginx
pgrep -f "gunicorn.*myapp"
# Find which process is listening on a port
ss -tlnp | grep ':8080'
# Find all Java processes and their arguments
ps aux | grep '[j]ava' | awk '{for(i=11;i<=NF;i++) printf "%s ", $i; print ""}'
# Find zombie processes
ps aux | grep -w Z | grep -v grep
Log Analysis One-Liners¶
# HTTP status code distribution from access log
awk '{print $9}' access.log | grep -E '^[0-9]{3}$' | sort | uniq -c | sort -rn
# Requests per minute over the last hour
grep "$(date +%d/%b/%Y:%H)" access.log | grep -oE ':[0-9]{2}:[0-9]{2}:' | sort | uniq -c
# Find all unique User-Agents
grep -oP '"[^"]*"\s*$' access.log | sort -u | head -30
# Detect potential brute-force: IPs with >100 failed logins
grep "401\|403" access.log | grep -oE '\b[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+\b' | sort | uniq -c | sort -rn | awk '$1 > 100'
# Identify slow database queries from app logs
grep -E 'query_time=[0-9]+' app.log | grep -oP 'query_time=\K[0-9]+' | sort -n | tail -20
# Find error bursts: lines per second with ERROR
grep "ERROR" app.log | grep -oE '^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}' | uniq -c | sort -rn | head -10
Finding Which Files Contain a String¶
# List files containing a pattern
grep -rl "deprecated_function" ./src/
# Count matches per file, sorted
grep -rc "WARNING" /var/log/ 2>/dev/null | grep -v ':0$' | sort -t: -k2 -rn
# Find files that do NOT contain a pattern (e.g., missing license header)
find ./src -name '*.py' -exec grep -L 'Copyright' {} +
# Find files containing one pattern but not another
grep -rl "class MyService" ./src/ | xargs grep -L "def health_check"
Scale note: On large codebases,
grep -rcan be painfully slow.ripgrep(rg) respects.gitignoreby default, skips binary files, and is 5-10x faster on large repos. Most modern ops teams aliasgreptorgfor interactive use and keepgrepfor scripts where portability matters.
Binary File Search¶
# Search binary files as if they were text
grep -a "password" core.dump
# Search for strings in a binary executable
strings /usr/bin/app | grep "version"
# Find printable strings near a pattern in a binary
grep -aob "ERROR" firmware.bin
Recursive Search with Exclusions¶
# Search everything except .git, node_modules, vendor
grep -r --exclude-dir='.git' --exclude-dir='node_modules' --exclude-dir='vendor' "TODO" .
# Search only specific file types, excluding test files
grep -r --include='*.py' --exclude='*_test.py' --exclude='test_*.py' "import" ./src/
# Production-ready codebase search: skip all the noise
grep -rn \
--include='*.py' --include='*.go' --include='*.js' --include='*.ts' \
--exclude-dir='.git' --exclude-dir='node_modules' --exclude-dir='vendor' \
--exclude-dir='__pycache__' --exclude-dir='.tox' --exclude-dir='build' \
"pattern" .
Monitoring Log Files¶
# Follow a log file and highlight errors
tail -f /var/log/app/server.log | grep --color=always -E 'ERROR|FATAL|$'
# Follow and filter — only show errors
tail -f /var/log/app/server.log | grep -E 'ERROR|FATAL'
> **Gotcha:** When piping `tail -f` through `grep`, output may be delayed because grep buffers when its stdout is not a terminal. Use `grep --line-buffered` to force immediate output, or the data arrives in chunks instead of real-time.
# Follow multiple log files and filter
tail -f /var/log/app/*.log | grep --line-buffered "ERROR"
# Monitor for a specific event and trigger an action
tail -f /var/log/auth.log | grep --line-buffered "Failed password" | while read -r line; do
echo "$(date): AUTH ALERT — ${line}" >> /var/log/alerts.log
done
# Watch for OOM kills in real-time
dmesg -w | grep -i "out of memory\|oom-killer\|killed process"
Multi-File Correlation¶
When debugging, you often need to trace a request across multiple log files:
# Find a request ID across all logs
REQUEST_ID="abc-123-def"
grep -rn "${REQUEST_ID}" /var/log/app/
# Trace a user's activity across services
USER="jane@company.com"
for log in /var/log/app/{auth,api,worker}.log; do
echo "=== ${log} ==="
grep "${USER}" "${log}" | tail -5
done
# Find correlated events within a time window
# (extract timestamps from one log, search in another)
grep "payment_failed" payment.log | grep -oE '^[^ ]+' | while read -r ts; do
grep "${ts}" notification.log
done
Quick Reference: grep Flags for Ops¶
| Scenario | Command |
|---|---|
| Find errors in log | grep -E 'ERROR\|FATAL' app.log |
| Count errors per file | grep -rc "ERROR" /var/log/ |
| Show context around match | grep -C 5 "timeout" app.log |
| List files containing pattern | grep -rl "TODO" ./src/ |
| Extract matched text only | grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' log |
| Search recursively, skip dirs | grep -r --exclude-dir='.git' "pat" . |
| Case-insensitive search | grep -i "error" app.log |
| Fixed string (no regex) | grep -F '$10.00' data.txt |
| Quiet mode for scripts | grep -q "pattern" file && echo "found" |
| Null-delimited for xargs | grep -rlZ "pat" . \| xargs -0 cmd |
Power One-Liners¶
Search for pattern excluding grep itself¶
Breakdown: [n]ginx matches the string "nginx" but the grep process itself shows grep [n]ginx in ps output, which doesn't match. The bracket trick eliminates the classic | grep -v grep hack.
Remember: The bracket trick works because
[n]ginxis a regex that matchesnginx, but the literal string[n]ginx(as shown in the process list) does not match itself as a regex. Elegant, butpgrep -la nginxis clearer and more portable.[!TIP] When to use: Finding processes by name in scripts without false matches.
Recursive search and replace (grep + sed combo)¶
grep -rn --include='*.py' 'old_func' . | head -5 # preview matches first
grep -rl --include='*.py' 'old_func' . | xargs sed -i 's/old_func/new_func/g'
[!TIP] When to use: Codebase-wide renames, URL updates, config migrations.