Portal | Level: L1: Foundations | Topics: Legacy System Archaeology | Domain: DevOps & Tooling
Legacy System Archaeology - Primer¶
Why This Matters¶
Every ops engineer's first day at a new job starts the same way: you're handed a system nobody fully understands, with documentation that's two years stale, and told "don't break anything." Legacy system archaeology is the skill of building a mental model of an inherited system — without the benefit of the people who built it, the original design docs, or accurate architecture diagrams.
This is not a niche skill. It's the first thing you need at every new job, every acquisition, every team transfer, and every on-call rotation for a service you didn't build. The ability to reverse-engineer an existing system from its artifacts — configs, cron jobs, logs, network connections, and running processes — is what separates someone who's dangerous from someone who's useful.
Core Concepts¶
1. The Archaeological Approach¶
Archaeology Mindset:
You are not here to judge. You are here to understand.
The previous team was not stupid. They were:
- Working with different constraints
- Solving different problems
- Under different time pressures
- Using the tools available at the time
Your job:
1. Observe what exists (don't change anything yet)
2. Understand why it exists (context, not just configuration)
3. Document what you find (for the next archaeologist)
4. Then — and only then — consider changes
2. The First Day Survey¶
When you inherit a system, run this survey before touching anything:
# Who am I, where am I, what is this?
hostname && uname -a && cat /etc/os-release
# How long has this system been running?
uptime
who -b # Last boot time
# What's the system's purpose? (infer from running services)
systemctl list-units --type=service --state=running | grep -v systemd
# What's listening on the network?
ss -tlnp # TCP listeners with process names
ss -ulnp # UDP listeners
# What's talking to what?
ss -tnp | awk '{print $5}' | sort | uniq -c | sort -rn | head -20
# What packages are installed? (what was this built to do?)
rpm -qa --qf '%{NAME}\n' 2>/dev/null | sort | head -50
dpkg -l --no-pager 2>/dev/null | awk '/^ii/{print $2}' | head -50
# What cron jobs exist?
for user in $(cut -f1 -d: /etc/passwd); do
crontab -l -u "$user" 2>/dev/null | grep -v '^#' | grep -v '^$' && echo " ^^^ ($user)"
done
ls -la /etc/cron.d/ /etc/cron.daily/ /etc/cron.hourly/ /etc/cron.weekly/ 2>/dev/null
# What's in /etc that's been customized?
find /etc -mtime -365 -type f 2>/dev/null | head -50
# What's using the most disk space?
du -sh /* 2>/dev/null | sort -rh | head -10
# What users exist beyond system accounts?
awk -F: '$3 >= 1000 {print $1, $3, $7}' /etc/passwd
3. Reading Configs You Didn't Write¶
The Config Reading Protocol:
Step 1: Find the REAL config
- The config in git may not match production
- The config in /etc may not be the one the process actually reads
- Check the process command line for config file paths:
ps aux | grep <process>
cat /proc/<PID>/cmdline | tr '\0' ' '
Step 2: Identify what was customized vs. default
- Compare against package defaults:
rpm -qf /etc/nginx/nginx.conf # What package owns this file?
rpm -V nginx # What changed from defaults?
dpkg -V nginx # Debian equivalent
- Or diff against a fresh install of the same version
Step 3: Read for intent, not just syntax
- Comments are gold (read every one)
- Uncommented-but-present settings tell you what was tried
- Look for "TODO" and "HACK" and "FIXME" comments
- Date-stamped comments reveal the history of changes
Step 4: Map config interdependencies
- Config file A references file B (includes, imports)
- Environment variables referenced in configs
- Templates vs. rendered configs (is this file generated?)
# Find which config file a process is actually using
ls -la /proc/$(pgrep nginx)/fd 2>/dev/null | grep -E '\.(conf|cfg|ini|yml|yaml|json)'
# Check what config files were opened at process start
strace -f -e openat -p $(pgrep nginx) 2>&1 | head -50
# Or if you can restart:
strace -f -e openat nginx -t 2>&1 | grep -v ENOENT
# Find all config files that reference a specific value
grep -r "database_host\|DB_HOST\|PGHOST" /etc/ /opt/ 2>/dev/null
# Compare running config against on-disk config (for services that support it)
nginx -T 2>/dev/null | head -50 # Running nginx config
postconf -n 2>/dev/null # Running Postfix config (non-defaults only)
sshd -T 2>/dev/null # Running SSH config
4. Tracing Dependencies Without Documentation¶
Dependency Discovery Methods:
Method 1: Network connections
ss -tnp → shows every TCP connection with process names
→ Your service connects to: database, cache, queue, other services
→ Map these connections to hostnames/IPs
Method 2: DNS lookups
tcpdump -i any -n port 53 → shows all DNS queries
→ What hostnames does the system resolve?
→ These are its dependencies
Method 3: Filesystem reads
lsof -p <PID> → shows all open files
→ Config files, log files, data files, sockets, libraries
→ Each one is a dependency or artifact
Method 4: Environment variables
cat /proc/<PID>/environ | tr '\0' '\n'
→ Database URLs, API endpoints, feature flags
→ These reveal the integration points
Method 5: systemd unit files
systemctl cat <service> → shows the unit file
→ After=, Requires=, Wants= reveal ordered dependencies
→ ExecStartPre= reveals setup steps
→ Environment= and EnvironmentFile= reveal config sources
# Build a dependency map from network connections
ss -tnp | awk '{print $4, $5}' | sort -u | while read local remote; do
# Try to resolve the remote IP to a hostname
remote_ip=$(echo "$remote" | cut -d: -f1)
remote_port=$(echo "$remote" | cut -d: -f2)
hostname=$(getent hosts "$remote_ip" 2>/dev/null | awk '{print $2}')
echo "$local → ${hostname:-$remote_ip}:$remote_port"
done
# Map systemd dependencies
systemd-analyze dot --to-pattern='*.service' 2>/dev/null | head -50
# Check what environment variables the service sees
systemctl show <service> -p Environment -p EnvironmentFile
# Find upstream dependencies from application config
grep -rh "host\|url\|endpoint\|server\|addr" /etc/myapp/ 2>/dev/null | grep -v '^#'
5. Understanding Cron Jobs and Timers¶
Cron Archaeology:
Cron jobs are the tribal knowledge of a system.
Nobody documents them. Nobody remembers adding them.
They run silently until they break — then they're everyone's emergency.
# Inventory ALL scheduled tasks
echo "=== System crontabs ==="
cat /etc/crontab 2>/dev/null
echo "=== Cron directories ==="
for dir in /etc/cron.d /etc/cron.daily /etc/cron.hourly /etc/cron.weekly /etc/cron.monthly; do
echo "--- $dir ---"
ls -la "$dir" 2>/dev/null
done
echo "=== User crontabs ==="
for user in $(cut -f1 -d: /etc/passwd); do
crons=$(crontab -l -u "$user" 2>/dev/null | grep -v '^#' | grep -v '^$')
if [ -n "$crons" ]; then
echo "--- $user ---"
echo "$crons"
fi
done
echo "=== Systemd timers ==="
systemctl list-timers --all --no-pager
echo "=== At jobs ==="
atq 2>/dev/null
For each cron job you find, answer:
Cron Job Analysis Template:
Schedule: [when does it run?]
Command: [what does it execute?]
User: [who does it run as?]
Output: [where does output go? /dev/null? Email? Log file?]
Purpose: [what problem does this solve?]
Dependencies: [what does it need to work? Network? Database? Disk space?]
Failure mode: [what happens if it fails? Silent? Alert? Data corruption?]
Last run: [is it currently working?]
Owner: [who added this? (check git blame, timestamps, comments)]
6. Finding the "Real" Config vs. What's in Git¶
The Config Drift Problem:
Git says: max_connections = 100
Server says: max_connections = 500
Documentation: max_connections = 200
Which is real? The server is real. Always.
Git is what someone intended.
The server is what's actually running.
# Compare deployed config to git
# Step 1: Find what's on the server
cat /etc/myapp/config.yml
# Step 2: Find what's in git
git -C /path/to/repo show HEAD:config/config.yml
# Step 3: Diff them
diff <(cat /etc/myapp/config.yml) <(git -C /path/to/repo show HEAD:config/config.yml)
# Common reasons for drift:
# 1. Manual hotfix applied during an incident, never committed
# 2. Config management tool (Ansible/Puppet) running a different branch
# 3. Environment variable overrides not reflected in file
# 4. Config file generated by a template with stale variables
# 5. Multiple config files in different locations, wrong one in git
7. Identifying Tribal Knowledge¶
Signs of Tribal Knowledge:
1. The README says "ask Dave about the deployment process"
→ Dave left 2 years ago
2. A script references /home/jsmith/scripts/deploy.sh
→ jsmith's home directory is the canonical source
3. A cron job runs a binary with no source code in the repo
→ Compiled from somewhere, nobody knows where
4. The monitoring dashboard was built by someone who left
→ Nobody knows what the queries mean or why the thresholds are set
5. "We always restart it on the first Monday of the month"
→ Nobody knows why, but bad things happen if you don't
Discovery technique:
Grep for names in configs, comments, and git logs:
git log --all --format='%an' | sort | uniq -c | sort -rn
→ Who wrote the most code? Are they still here?
→ If not, their commits are your primary documentation
8. Building a Mental Model¶
The System Map:
Draw this as you discover it:
┌─────────────────┐ ┌─────────────────┐
│ Load Balancer │────▶│ Web Server (x3) │
│ (nginx, port 443) │ │ (gunicorn :8000) │
└─────────────────┘ └────────┬────────┘
│
┌────────▼────────┐
│ App Server │
│ (Python/Flask) │
└───┬────┬────┬───┘
│ │ │
┌─────────┘ │ └─────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ PostgreSQL│ │ Redis │ │ S3 │
│ (primary) │ │ (cache) │ │ (assets) │
└──────────┘ └──────────┘ └──────────┘
For each component, note:
- What host(s) it runs on
- What port(s) it listens on
- What it depends on
- What depends on it
- Where the config lives
- Where the logs go
- How it's deployed
- Who owns it (if anyone)
Common Pitfalls¶
- Changing things before understanding them — The urge to "clean up" legacy code on day one is strong and dangerous. Observe for at least two weeks before making non-emergency changes.
- Trusting the documentation — Documentation describes intent, not reality. Always verify against the running system.
diffthe config on disk against what's in git. - Assuming it's badly designed — What looks like a hack may be a brilliant workaround for a constraint you don't know about. Assume competence until proven otherwise.
- Ignoring cron jobs — Cron jobs are the dark matter of infrastructure. They hold critical processes together and nobody documents them. Inventory them on day one.
- Not documenting what you find — You're doing archaeology. If you don't write it down, the next person will have to rediscover everything you learned. Be the archaeologist who publishes.
- Deleting things that "aren't used" — If you don't understand what something does, you can't know it's unused. Disable before deleting. Wait a month. Then delete.
Wiki Navigation¶
Prerequisites¶
- Linux Ops (Topic Pack, L0)
Related Content¶
- Legacy Systems Flashcards (CLI) (flashcard_deck, L1) — Legacy System Archaeology