Skip to content

grep & Regular Expressions - Street-Level Ops

Real-world patterns and one-liners for production troubleshooting with grep.

Finding Error Patterns in Logs

# The workhorse: find all high-severity log lines
grep -E 'ERROR|FATAL|CRIT|PANIC' /var/log/app/server.log

# Case-insensitive, with 3 lines of context after each match
grep -iE 'error|fatal|critical' -A 3 /var/log/app/server.log

# Count errors per hour from timestamped logs
grep -E 'ERROR' app.log | grep -oE '^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}' | uniq -c

# Find errors in rotated and compressed logs
zgrep -c "ERROR" /var/log/app/server.log.*.gz | grep -v ':0$'

# Find the first occurrence of an error (when did it start?)
grep -n "connection pool exhausted" app.log | head -1

# Find the last occurrence (is it still happening?)
grep -n "connection pool exhausted" app.log | tail -1

> **Debug clue:** During incidents, always check both first and last occurrence. If the first and last are minutes apart, the issue is ongoing. If the last was hours ago, it may have self-resolved and you are chasing a ghost. The time delta between first and last is your incident duration estimate.

Remember: When investigating an incident, grep -c "ERROR" app.log (count) is faster and more useful than grep "ERROR" app.log | wc -l. The -c flag counts matches per file without piping, and when used with -r it gives you a per-file breakdown instantly: grep -rc "ERROR" /var/log/app/.

Extracting IPs from Logs

# Extract all unique IPs from an access log
grep -oE '\b[0-9]{1,3}(\.[0-9]{1,3}){3}\b' access.log | sort -u

# Top 20 IPs by request count
grep -oE '\b[0-9]{1,3}(\.[0-9]{1,3}){3}\b' access.log | sort | uniq -c | sort -rn | head -20

# Find all requests from a specific subnet
grep -E '\b10\.0\.3\.[0-9]{1,3}\b' access.log

# Extract IPs that triggered 5xx errors (assuming combined log format)
awk '$9 ~ /^5/' access.log | grep -oE '\b[0-9]{1,3}(\.[0-9]{1,3}){3}\b' | sort | uniq -c | sort -rn

# Find IPs hitting a specific endpoint
grep 'POST /api/login' access.log | grep -oE '\b[0-9]{1,3}(\.[0-9]{1,3}){3}\b' | sort | uniq -c | sort -rn

Searching Codebases

# Find all TODO/FIXME/HACK comments
grep -rnE '(TODO|FIXME|HACK|XXX|WARN):?' --include='*.py' --include='*.go' .

# Find function definitions in Python
grep -rnE '^\s*def\s+\w+' --include='*.py' ./src/

# Find all imports of a specific package
grep -rn 'import requests' --include='*.py' .
grep -rn '"github.com/lib/pq"' --include='*.go' .

# Find hardcoded passwords or secrets (audit pattern)
grep -rnE '(password|secret|api_key|token)\s*=\s*["\x27][^"\x27]+["\x27]' --include='*.py' --include='*.yaml' --include='*.env' .

> **One-liner:** This audit pattern catches the low-hanging fruit but misses secrets in JSON, base64-encoded strings, and non-obvious variable names. For real secret scanning, use dedicated tools like `gitleaks` or `trufflehog` that detect entropy-based patterns and known secret formats.

# Find files that reference a database host
grep -rl 'db-primary\|db-replica\|postgres://' --include='*.yaml' --include='*.env' --include='*.py' .

# Find where an environment variable is used
grep -rn 'DATABASE_URL' --include='*.py' --include='*.sh' --include='*.yaml' .

Finding Config Values

# Extract key=value pairs from config files
grep -E '^\s*[a-zA-Z_][a-zA-Z0-9_]*\s*=' /etc/app/config.ini

# Find all non-comment, non-empty lines in a config
grep -vE '^\s*(#|;|$)' /etc/nginx/nginx.conf

# Find all listen directives in nginx configs
grep -rn 'listen' /etc/nginx/sites-enabled/

# Find all enabled systemd services
systemctl list-unit-files | grep enabled

# Extract a specific YAML value (crude but effective)
grep -A 1 'database:' config.yaml | grep 'host:'

# Find all ports mentioned in docker-compose
grep -E 'ports:|:\s*"?[0-9]+' docker-compose.yml

grep + awk Pipelines

# Top 10 slowest requests from access log (response time in last field)
grep "200 OK" access.log | awk '{print $NF, $7}' | sort -rn | head -10

# Average response time for a specific endpoint
grep 'GET /api/health' access.log | awk '{sum += $NF; n++} END {print sum/n " ms avg over " n " requests"}'

# Find requests that took more than 5 seconds
grep -E 'GET|POST' access.log | awk '$NF > 5000 {print}'

# Extract and count HTTP methods
grep -oE '(GET|POST|PUT|DELETE|PATCH)' access.log | sort | uniq -c | sort -rn

# Find lines where a field exceeds a threshold
grep "cpu_usage" metrics.log | awk -F= '$2 > 90 {print}'

Finding Processes

# Find a process (avoiding the grep-matches-itself problem)
ps aux | grep '[n]ginx'
ps aux | grep '[p]ostgres'

# Better: use pgrep
pgrep -la nginx
pgrep -f "gunicorn.*myapp"

# Find which process is listening on a port
ss -tlnp | grep ':8080'

# Find all Java processes and their arguments
ps aux | grep '[j]ava' | awk '{for(i=11;i<=NF;i++) printf "%s ", $i; print ""}'

# Find zombie processes
ps aux | grep -w Z | grep -v grep

Log Analysis One-Liners

# HTTP status code distribution from access log
awk '{print $9}' access.log | grep -E '^[0-9]{3}$' | sort | uniq -c | sort -rn

# Requests per minute over the last hour
grep "$(date +%d/%b/%Y:%H)" access.log | grep -oE ':[0-9]{2}:[0-9]{2}:' | sort | uniq -c

# Find all unique User-Agents
grep -oP '"[^"]*"\s*$' access.log | sort -u | head -30

# Detect potential brute-force: IPs with >100 failed logins
grep "401\|403" access.log | grep -oE '\b[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+\b' | sort | uniq -c | sort -rn | awk '$1 > 100'

# Identify slow database queries from app logs
grep -E 'query_time=[0-9]+' app.log | grep -oP 'query_time=\K[0-9]+' | sort -n | tail -20

# Find error bursts: lines per second with ERROR
grep "ERROR" app.log | grep -oE '^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}' | uniq -c | sort -rn | head -10

Finding Which Files Contain a String

# List files containing a pattern
grep -rl "deprecated_function" ./src/

# Count matches per file, sorted
grep -rc "WARNING" /var/log/ 2>/dev/null | grep -v ':0$' | sort -t: -k2 -rn

# Find files that do NOT contain a pattern (e.g., missing license header)
find ./src -name '*.py' -exec grep -L 'Copyright' {} +

# Find files containing one pattern but not another
grep -rl "class MyService" ./src/ | xargs grep -L "def health_check"

Scale note: On large codebases, grep -r can be painfully slow. ripgrep (rg) respects .gitignore by default, skips binary files, and is 5-10x faster on large repos. Most modern ops teams alias grep to rg for interactive use and keep grep for scripts where portability matters.

# Search binary files as if they were text
grep -a "password" core.dump

# Search for strings in a binary executable
strings /usr/bin/app | grep "version"

# Find printable strings near a pattern in a binary
grep -aob "ERROR" firmware.bin

Recursive Search with Exclusions

# Search everything except .git, node_modules, vendor
grep -r --exclude-dir='.git' --exclude-dir='node_modules' --exclude-dir='vendor' "TODO" .

# Search only specific file types, excluding test files
grep -r --include='*.py' --exclude='*_test.py' --exclude='test_*.py' "import" ./src/

# Production-ready codebase search: skip all the noise
grep -rn \
    --include='*.py' --include='*.go' --include='*.js' --include='*.ts' \
    --exclude-dir='.git' --exclude-dir='node_modules' --exclude-dir='vendor' \
    --exclude-dir='__pycache__' --exclude-dir='.tox' --exclude-dir='build' \
    "pattern" .

Monitoring Log Files

# Follow a log file and highlight errors
tail -f /var/log/app/server.log | grep --color=always -E 'ERROR|FATAL|$'

# Follow and filter — only show errors
tail -f /var/log/app/server.log | grep -E 'ERROR|FATAL'

> **Gotcha:** When piping `tail -f` through `grep`, output may be delayed because grep buffers when its stdout is not a terminal. Use `grep --line-buffered` to force immediate output, or the data arrives in chunks instead of real-time.

# Follow multiple log files and filter
tail -f /var/log/app/*.log | grep --line-buffered "ERROR"

# Monitor for a specific event and trigger an action
tail -f /var/log/auth.log | grep --line-buffered "Failed password" | while read -r line; do
    echo "$(date): AUTH ALERT — ${line}" >> /var/log/alerts.log
done

# Watch for OOM kills in real-time
dmesg -w | grep -i "out of memory\|oom-killer\|killed process"

Multi-File Correlation

When debugging, you often need to trace a request across multiple log files:

# Find a request ID across all logs
REQUEST_ID="abc-123-def"
grep -rn "${REQUEST_ID}" /var/log/app/

# Trace a user's activity across services
USER="jane@company.com"
for log in /var/log/app/{auth,api,worker}.log; do
    echo "=== ${log} ==="
    grep "${USER}" "${log}" | tail -5
done

# Find correlated events within a time window
# (extract timestamps from one log, search in another)
grep "payment_failed" payment.log | grep -oE '^[^ ]+' | while read -r ts; do
    grep "${ts}" notification.log
done

Quick Reference: grep Flags for Ops

Scenario Command
Find errors in log grep -E 'ERROR\|FATAL' app.log
Count errors per file grep -rc "ERROR" /var/log/
Show context around match grep -C 5 "timeout" app.log
List files containing pattern grep -rl "TODO" ./src/
Extract matched text only grep -oE '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' log
Search recursively, skip dirs grep -r --exclude-dir='.git' "pat" .
Case-insensitive search grep -i "error" app.log
Fixed string (no regex) grep -F '$10.00' data.txt
Quiet mode for scripts grep -q "pattern" file && echo "found"
Null-delimited for xargs grep -rlZ "pat" . \| xargs -0 cmd

Power One-Liners

Search for pattern excluding grep itself

ps aux | grep '[n]ginx'

Breakdown: [n]ginx matches the string "nginx" but the grep process itself shows grep [n]ginx in ps output, which doesn't match. The bracket trick eliminates the classic | grep -v grep hack.

Remember: The bracket trick works because [n]ginx is a regex that matches nginx, but the literal string [n]ginx (as shown in the process list) does not match itself as a regex. Elegant, but pgrep -la nginx is clearer and more portable.

[!TIP] When to use: Finding processes by name in scripts without false matches.

Recursive search and replace (grep + sed combo)

grep -rn --include='*.py' 'old_func' . | head -5   # preview matches first
grep -rl --include='*.py' 'old_func' . | xargs sed -i 's/old_func/new_func/g'

[!TIP] When to use: Codebase-wide renames, URL updates, config migrations.