Skip to content

Portal | Level: L1: Foundations | Topics: Legacy System Archaeology | Domain: DevOps & Tooling

Legacy System Archaeology - Primer

Why This Matters

Every ops engineer's first day at a new job starts the same way: you're handed a system nobody fully understands, with documentation that's two years stale, and told "don't break anything." Legacy system archaeology is the skill of building a mental model of an inherited system — without the benefit of the people who built it, the original design docs, or accurate architecture diagrams.

This is not a niche skill. It's the first thing you need at every new job, every acquisition, every team transfer, and every on-call rotation for a service you didn't build. The ability to reverse-engineer an existing system from its artifacts — configs, cron jobs, logs, network connections, and running processes — is what separates someone who's dangerous from someone who's useful.

Core Concepts

1. The Archaeological Approach

Archaeology Mindset:

  You are not here to judge. You are here to understand.

  The previous team was not stupid. They were:
  - Working with different constraints
  - Solving different problems
  - Under different time pressures
  - Using the tools available at the time

  Your job:
  1. Observe what exists (don't change anything yet)
  2. Understand why it exists (context, not just configuration)
  3. Document what you find (for the next archaeologist)
  4. Then — and only then — consider changes

2. The First Day Survey

When you inherit a system, run this survey before touching anything:

# Who am I, where am I, what is this?
hostname && uname -a && cat /etc/os-release

# How long has this system been running?
uptime
who -b                                # Last boot time

# What's the system's purpose? (infer from running services)
systemctl list-units --type=service --state=running | grep -v systemd

# What's listening on the network?
ss -tlnp                              # TCP listeners with process names
ss -ulnp                              # UDP listeners

# What's talking to what?
ss -tnp | awk '{print $5}' | sort | uniq -c | sort -rn | head -20

# What packages are installed? (what was this built to do?)
rpm -qa --qf '%{NAME}\n' 2>/dev/null | sort | head -50
dpkg -l --no-pager 2>/dev/null | awk '/^ii/{print $2}' | head -50

# What cron jobs exist?
for user in $(cut -f1 -d: /etc/passwd); do
  crontab -l -u "$user" 2>/dev/null | grep -v '^#' | grep -v '^$' && echo "  ^^^ ($user)"
done
ls -la /etc/cron.d/ /etc/cron.daily/ /etc/cron.hourly/ /etc/cron.weekly/ 2>/dev/null

# What's in /etc that's been customized?
find /etc -mtime -365 -type f 2>/dev/null | head -50

# What's using the most disk space?
du -sh /* 2>/dev/null | sort -rh | head -10

# What users exist beyond system accounts?
awk -F: '$3 >= 1000 {print $1, $3, $7}' /etc/passwd

3. Reading Configs You Didn't Write

The Config Reading Protocol:

  Step 1: Find the REAL config
    - The config in git may not match production
    - The config in /etc may not be the one the process actually reads
    - Check the process command line for config file paths:
      ps aux | grep <process>
      cat /proc/<PID>/cmdline | tr '\0' ' '

  Step 2: Identify what was customized vs. default
    - Compare against package defaults:
      rpm -qf /etc/nginx/nginx.conf      # What package owns this file?
      rpm -V nginx                         # What changed from defaults?
      dpkg -V nginx                        # Debian equivalent
    - Or diff against a fresh install of the same version

  Step 3: Read for intent, not just syntax
    - Comments are gold (read every one)
    - Uncommented-but-present settings tell you what was tried
    - Look for "TODO" and "HACK" and "FIXME" comments
    - Date-stamped comments reveal the history of changes

  Step 4: Map config interdependencies
    - Config file A references file B (includes, imports)
    - Environment variables referenced in configs
    - Templates vs. rendered configs (is this file generated?)
# Find which config file a process is actually using
ls -la /proc/$(pgrep nginx)/fd 2>/dev/null | grep -E '\.(conf|cfg|ini|yml|yaml|json)'

# Check what config files were opened at process start
strace -f -e openat -p $(pgrep nginx) 2>&1 | head -50
# Or if you can restart:
strace -f -e openat nginx -t 2>&1 | grep -v ENOENT

# Find all config files that reference a specific value
grep -r "database_host\|DB_HOST\|PGHOST" /etc/ /opt/ 2>/dev/null

# Compare running config against on-disk config (for services that support it)
nginx -T 2>/dev/null | head -50     # Running nginx config
postconf -n 2>/dev/null              # Running Postfix config (non-defaults only)
sshd -T 2>/dev/null                  # Running SSH config

4. Tracing Dependencies Without Documentation

Dependency Discovery Methods:

  Method 1: Network connections
    ss -tnp → shows every TCP connection with process names
    → Your service connects to: database, cache, queue, other services
    → Map these connections to hostnames/IPs

  Method 2: DNS lookups
    tcpdump -i any -n port 53   → shows all DNS queries
    → What hostnames does the system resolve?
    → These are its dependencies

  Method 3: Filesystem reads
    lsof -p <PID>               → shows all open files
    → Config files, log files, data files, sockets, libraries
    → Each one is a dependency or artifact

  Method 4: Environment variables
    cat /proc/<PID>/environ | tr '\0' '\n'
    → Database URLs, API endpoints, feature flags
    → These reveal the integration points

  Method 5: systemd unit files
    systemctl cat <service>     → shows the unit file
    → After=, Requires=, Wants= reveal ordered dependencies
    → ExecStartPre= reveals setup steps
    → Environment= and EnvironmentFile= reveal config sources
# Build a dependency map from network connections
ss -tnp | awk '{print $4, $5}' | sort -u | while read local remote; do
  # Try to resolve the remote IP to a hostname
  remote_ip=$(echo "$remote" | cut -d: -f1)
  remote_port=$(echo "$remote" | cut -d: -f2)
  hostname=$(getent hosts "$remote_ip" 2>/dev/null | awk '{print $2}')
  echo "$local${hostname:-$remote_ip}:$remote_port"
done

# Map systemd dependencies
systemd-analyze dot --to-pattern='*.service' 2>/dev/null | head -50

# Check what environment variables the service sees
systemctl show <service> -p Environment -p EnvironmentFile

# Find upstream dependencies from application config
grep -rh "host\|url\|endpoint\|server\|addr" /etc/myapp/ 2>/dev/null | grep -v '^#'

5. Understanding Cron Jobs and Timers

Cron Archaeology:

  Cron jobs are the tribal knowledge of a system.
  Nobody documents them. Nobody remembers adding them.
  They run silently until they break — then they're everyone's emergency.
# Inventory ALL scheduled tasks
echo "=== System crontabs ==="
cat /etc/crontab 2>/dev/null

echo "=== Cron directories ==="
for dir in /etc/cron.d /etc/cron.daily /etc/cron.hourly /etc/cron.weekly /etc/cron.monthly; do
  echo "--- $dir ---"
  ls -la "$dir" 2>/dev/null
done

echo "=== User crontabs ==="
for user in $(cut -f1 -d: /etc/passwd); do
  crons=$(crontab -l -u "$user" 2>/dev/null | grep -v '^#' | grep -v '^$')
  if [ -n "$crons" ]; then
    echo "--- $user ---"
    echo "$crons"
  fi
done

echo "=== Systemd timers ==="
systemctl list-timers --all --no-pager

echo "=== At jobs ==="
atq 2>/dev/null

For each cron job you find, answer:

Cron Job Analysis Template:

  Schedule:    [when does it run?]
  Command:     [what does it execute?]
  User:        [who does it run as?]
  Output:      [where does output go? /dev/null? Email? Log file?]
  Purpose:     [what problem does this solve?]
  Dependencies: [what does it need to work? Network? Database? Disk space?]
  Failure mode: [what happens if it fails? Silent? Alert? Data corruption?]
  Last run:    [is it currently working?]
  Owner:       [who added this? (check git blame, timestamps, comments)]

6. Finding the "Real" Config vs. What's in Git

The Config Drift Problem:

  Git says:       max_connections = 100
  Server says:    max_connections = 500
  Documentation:  max_connections = 200

  Which is real? The server is real. Always.
  Git is what someone intended.
  The server is what's actually running.
# Compare deployed config to git
# Step 1: Find what's on the server
cat /etc/myapp/config.yml

# Step 2: Find what's in git
git -C /path/to/repo show HEAD:config/config.yml

# Step 3: Diff them
diff <(cat /etc/myapp/config.yml) <(git -C /path/to/repo show HEAD:config/config.yml)

# Common reasons for drift:
# 1. Manual hotfix applied during an incident, never committed
# 2. Config management tool (Ansible/Puppet) running a different branch
# 3. Environment variable overrides not reflected in file
# 4. Config file generated by a template with stale variables
# 5. Multiple config files in different locations, wrong one in git

7. Identifying Tribal Knowledge

Signs of Tribal Knowledge:

  1. The README says "ask Dave about the deployment process"
     → Dave left 2 years ago

  2. A script references /home/jsmith/scripts/deploy.sh
     → jsmith's home directory is the canonical source

  3. A cron job runs a binary with no source code in the repo
     → Compiled from somewhere, nobody knows where

  4. The monitoring dashboard was built by someone who left
     → Nobody knows what the queries mean or why the thresholds are set

  5. "We always restart it on the first Monday of the month"
     → Nobody knows why, but bad things happen if you don't

  Discovery technique:
    Grep for names in configs, comments, and git logs:
    git log --all --format='%an' | sort | uniq -c | sort -rn
    → Who wrote the most code? Are they still here?
    → If not, their commits are your primary documentation

8. Building a Mental Model

The System Map:

  Draw this as you discover it:

  ┌─────────────────┐     ┌─────────────────┐
  │ Load Balancer     │────▶│ Web Server (x3)  │
  │ (nginx, port 443) │     │ (gunicorn :8000)  │
  └─────────────────┘     └────────┬────────┘
                          ┌────────▼────────┐
                          │ App Server       │
                          │ (Python/Flask)    │
                          └───┬────┬────┬───┘
                              │    │    │
                    ┌─────────┘    │    └─────────┐
                    ▼              ▼              ▼
              ┌──────────┐ ┌──────────┐ ┌──────────┐
              │ PostgreSQL│ │ Redis     │ │ S3       │
              │ (primary) │ │ (cache)   │ │ (assets) │
              └──────────┘ └──────────┘ └──────────┘

  For each component, note:
  - What host(s) it runs on
  - What port(s) it listens on
  - What it depends on
  - What depends on it
  - Where the config lives
  - Where the logs go
  - How it's deployed
  - Who owns it (if anyone)

Common Pitfalls

  1. Changing things before understanding them — The urge to "clean up" legacy code on day one is strong and dangerous. Observe for at least two weeks before making non-emergency changes.
  2. Trusting the documentation — Documentation describes intent, not reality. Always verify against the running system. diff the config on disk against what's in git.
  3. Assuming it's badly designed — What looks like a hack may be a brilliant workaround for a constraint you don't know about. Assume competence until proven otherwise.
  4. Ignoring cron jobs — Cron jobs are the dark matter of infrastructure. They hold critical processes together and nobody documents them. Inventory them on day one.
  5. Not documenting what you find — You're doing archaeology. If you don't write it down, the next person will have to rediscover everything you learned. Be the archaeologist who publishes.
  6. Deleting things that "aren't used" — If you don't understand what something does, you can't know it's unused. Disable before deleting. Wait a month. Then delete.

Wiki Navigation

Prerequisites

  • Legacy Systems Flashcards (CLI) (flashcard_deck, L1) — Legacy System Archaeology