Portal | Level: L1: Foundations | Topics: Legacy System Archaeology | Domain: DevOps & Tooling

Legacy System Archaeology - Primer¶

Why This Matters¶

Every ops engineer's first day at a new job starts the same way: you're handed a system nobody fully understands, with documentation that's two years stale, and told "don't break anything." Legacy system archaeology is the skill of building a mental model of an inherited system — without the benefit of the people who built it, the original design docs, or accurate architecture diagrams.

This is not a niche skill. It's the first thing you need at every new job, every acquisition, every team transfer, and every on-call rotation for a service you didn't build. The ability to reverse-engineer an existing system from its artifacts — configs, cron jobs, logs, network connections, and running processes — is what separates someone who's dangerous from someone who's useful.

Core Concepts¶

1. The Archaeological Approach¶

Archaeology Mindset:

  You are not here to judge. You are here to understand.

  The previous team was not stupid. They were:
  - Working with different constraints
  - Solving different problems
  - Under different time pressures
  - Using the tools available at the time

  Your job:
  1. Observe what exists (don't change anything yet)
  2. Understand why it exists (context, not just configuration)
  3. Document what you find (for the next archaeologist)
  4. Then — and only then — consider changes

2. The First Day Survey¶

When you inherit a system, run this survey before touching anything:

# Who am I, where am I, what is this?
hostname && uname -a && cat /etc/os-release

# How long has this system been running?
uptime
who -b                                # Last boot time

# What's the system's purpose? (infer from running services)
systemctl list-units --type=service --state=running | grep -v systemd

# What's listening on the network?
ss -tlnp                              # TCP listeners with process names
ss -ulnp                              # UDP listeners

# What's talking to what?
ss -tnp | awk '{print $5}' | sort | uniq -c | sort -rn | head -20

# What packages are installed? (what was this built to do?)
rpm -qa --qf '%{NAME}\n' 2>/dev/null | sort | head -50
dpkg -l --no-pager 2>/dev/null | awk '/^ii/{print $2}' | head -50

# What cron jobs exist?
for user in $(cut -f1 -d: /etc/passwd); do
  crontab -l -u "$user" 2>/dev/null | grep -v '^#' | grep -v '^$' && echo "  ^^^ ($user)"
done
ls -la /etc/cron.d/ /etc/cron.daily/ /etc/cron.hourly/ /etc/cron.weekly/ 2>/dev/null

# What's in /etc that's been customized?
find /etc -mtime -365 -type f 2>/dev/null | head -50

# What's using the most disk space?
du -sh /* 2>/dev/null | sort -rh | head -10

# What users exist beyond system accounts?
awk -F: '$3 >= 1000 {print $1, $3, $7}' /etc/passwd

3. Reading Configs You Didn't Write¶

The Config Reading Protocol:

  Step 1: Find the REAL config
    - The config in git may not match production
    - The config in /etc may not be the one the process actually reads
    - Check the process command line for config file paths:
      ps aux | grep <process>
      cat /proc/<PID>/cmdline | tr '\0' ' '

  Step 2: Identify what was customized vs. default
    - Compare against package defaults:
      rpm -qf /etc/nginx/nginx.conf      # What package owns this file?
      rpm -V nginx                         # What changed from defaults?
      dpkg -V nginx                        # Debian equivalent
    - Or diff against a fresh install of the same version

  Step 3: Read for intent, not just syntax
    - Comments are gold (read every one)
    - Uncommented-but-present settings tell you what was tried
    - Look for "TODO" and "HACK" and "FIXME" comments
    - Date-stamped comments reveal the history of changes

  Step 4: Map config interdependencies
    - Config file A references file B (includes, imports)
    - Environment variables referenced in configs
    - Templates vs. rendered configs (is this file generated?)

# Find which config file a process is actually using
ls -la /proc/$(pgrep nginx)/fd 2>/dev/null | grep -E '\.(conf|cfg|ini|yml|yaml|json)'

# Check what config files were opened at process start
strace -f -e openat -p $(pgrep nginx) 2>&1 | head -50
# Or if you can restart:
strace -f -e openat nginx -t 2>&1 | grep -v ENOENT

# Find all config files that reference a specific value
grep -r "database_host\|DB_HOST\|PGHOST" /etc/ /opt/ 2>/dev/null

# Compare running config against on-disk config (for services that support it)
nginx -T 2>/dev/null | head -50     # Running nginx config
postconf -n 2>/dev/null              # Running Postfix config (non-defaults only)
sshd -T 2>/dev/null                  # Running SSH config

4. Tracing Dependencies Without Documentation¶

Dependency Discovery Methods:

  Method 1: Network connections
    ss -tnp → shows every TCP connection with process names
    → Your service connects to: database, cache, queue, other services
    → Map these connections to hostnames/IPs

  Method 2: DNS lookups
    tcpdump -i any -n port 53   → shows all DNS queries
    → What hostnames does the system resolve?
    → These are its dependencies

  Method 3: Filesystem reads
    lsof -p <PID>               → shows all open files
    → Config files, log files, data files, sockets, libraries
    → Each one is a dependency or artifact

  Method 4: Environment variables
    cat /proc/<PID>/environ | tr '\0' '\n'
    → Database URLs, API endpoints, feature flags
    → These reveal the integration points

  Method 5: systemd unit files
    systemctl cat <service>     → shows the unit file
    → After=, Requires=, Wants= reveal ordered dependencies
    → ExecStartPre= reveals setup steps
    → Environment= and EnvironmentFile= reveal config sources

# Build a dependency map from network connections
ss -tnp | awk '{print $4, $5}' | sort -u | while read local remote; do
  # Try to resolve the remote IP to a hostname
  remote_ip=$(echo "$remote" | cut -d: -f1)
  remote_port=$(echo "$remote" | cut -d: -f2)
  hostname=$(getent hosts "$remote_ip" 2>/dev/null | awk '{print $2}')
  echo "$local → ${hostname:-$remote_ip}:$remote_port"
done

# Map systemd dependencies
systemd-analyze dot --to-pattern='*.service' 2>/dev/null | head -50

# Check what environment variables the service sees
systemctl show <service> -p Environment -p EnvironmentFile

# Find upstream dependencies from application config
grep -rh "host\|url\|endpoint\|server\|addr" /etc/myapp/ 2>/dev/null | grep -v '^#'

5. Understanding Cron Jobs and Timers¶

Cron Archaeology:

  Cron jobs are the tribal knowledge of a system.
  Nobody documents them. Nobody remembers adding them.
  They run silently until they break — then they're everyone's emergency.

# Inventory ALL scheduled tasks
echo "=== System crontabs ==="
cat /etc/crontab 2>/dev/null

echo "=== Cron directories ==="
for dir in /etc/cron.d /etc/cron.daily /etc/cron.hourly /etc/cron.weekly /etc/cron.monthly; do
  echo "--- $dir ---"
  ls -la "$dir" 2>/dev/null
done

echo "=== User crontabs ==="
for user in $(cut -f1 -d: /etc/passwd); do
  crons=$(crontab -l -u "$user" 2>/dev/null | grep -v '^#' | grep -v '^$')
  if [ -n "$crons" ]; then
    echo "--- $user ---"
    echo "$crons"
  fi
done

echo "=== Systemd timers ==="
systemctl list-timers --all --no-pager

echo "=== At jobs ==="
atq 2>/dev/null

For each cron job you find, answer:

Cron Job Analysis Template:

  Schedule:    [when does it run?]
  Command:     [what does it execute?]
  User:        [who does it run as?]
  Output:      [where does output go? /dev/null? Email? Log file?]
  Purpose:     [what problem does this solve?]
  Dependencies: [what does it need to work? Network? Database? Disk space?]
  Failure mode: [what happens if it fails? Silent? Alert? Data corruption?]
  Last run:    [is it currently working?]
  Owner:       [who added this? (check git blame, timestamps, comments)]

6. Finding the "Real" Config vs. What's in Git¶

The Config Drift Problem:

  Git says:       max_connections = 100
  Server says:    max_connections = 500
  Documentation:  max_connections = 200

  Which is real? The server is real. Always.
  Git is what someone intended.
  The server is what's actually running.

# Compare deployed config to git
# Step 1: Find what's on the server
cat /etc/myapp/config.yml

# Step 2: Find what's in git
git -C /path/to/repo show HEAD:config/config.yml

# Step 3: Diff them
diff <(cat /etc/myapp/config.yml) <(git -C /path/to/repo show HEAD:config/config.yml)

# Common reasons for drift:
# 1. Manual hotfix applied during an incident, never committed
# 2. Config management tool (Ansible/Puppet) running a different branch
# 3. Environment variable overrides not reflected in file
# 4. Config file generated by a template with stale variables
# 5. Multiple config files in different locations, wrong one in git

7. Identifying Tribal Knowledge¶

Signs of Tribal Knowledge:

  1. The README says "ask Dave about the deployment process"
     → Dave left 2 years ago

  2. A script references /home/jsmith/scripts/deploy.sh
     → jsmith's home directory is the canonical source

  3. A cron job runs a binary with no source code in the repo
     → Compiled from somewhere, nobody knows where

  4. The monitoring dashboard was built by someone who left
     → Nobody knows what the queries mean or why the thresholds are set

  5. "We always restart it on the first Monday of the month"
     → Nobody knows why, but bad things happen if you don't

  Discovery technique:
    Grep for names in configs, comments, and git logs:
    git log --all --format='%an' | sort | uniq -c | sort -rn
    → Who wrote the most code? Are they still here?
    → If not, their commits are your primary documentation

8. Building a Mental Model¶

The System Map:

  Draw this as you discover it:

  ┌─────────────────┐     ┌─────────────────┐
  │ Load Balancer     │────▶│ Web Server (x3)  │
  │ (nginx, port 443) │     │ (gunicorn :8000)  │
  └─────────────────┘     └────────┬────────┘
                                    │
                          ┌────────▼────────┐
                          │ App Server       │
                          │ (Python/Flask)    │
                          └───┬────┬────┬───┘
                              │    │    │
                    ┌─────────┘    │    └─────────┐
                    ▼              ▼              ▼
              ┌──────────┐ ┌──────────┐ ┌──────────┐
              │ PostgreSQL│ │ Redis     │ │ S3       │
              │ (primary) │ │ (cache)   │ │ (assets) │
              └──────────┘ └──────────┘ └──────────┘

  For each component, note:
  - What host(s) it runs on
  - What port(s) it listens on
  - What it depends on
  - What depends on it
  - Where the config lives
  - Where the logs go
  - How it's deployed
  - Who owns it (if anyone)

Common Pitfalls¶

Changing things before understanding them — The urge to "clean up" legacy code on day one is strong and dangerous. Observe for at least two weeks before making non-emergency changes.
Trusting the documentation — Documentation describes intent, not reality. Always verify against the running system. diff the config on disk against what's in git.
Assuming it's badly designed — What looks like a hack may be a brilliant workaround for a constraint you don't know about. Assume competence until proven otherwise.
Ignoring cron jobs — Cron jobs are the dark matter of infrastructure. They hold critical processes together and nobody documents them. Inventory them on day one.
Not documenting what you find — You're doing archaeology. If you don't write it down, the next person will have to rediscover everything you learned. Be the archaeologist who publishes.
Deleting things that "aren't used" — If you don't understand what something does, you can't know it's unused. Disable before deleting. Wait a month. Then delete.

Prerequisites¶

Linux Ops (Topic Pack, L0)

Legacy Systems Flashcards (CLI) (flashcard_deck, L1) — Legacy System Archaeology

Legacy System Archaeology - Primer¶

Why This Matters¶

Core Concepts¶

1. The Archaeological Approach¶

2. The First Day Survey¶

3. Reading Configs You Didn't Write¶

4. Tracing Dependencies Without Documentation¶

5. Understanding Cron Jobs and Timers¶

6. Finding the "Real" Config vs. What's in Git¶

7. Identifying Tribal Knowledge¶

8. Building a Mental Model¶

Common Pitfalls¶

Wiki Navigation¶

Prerequisites¶

Pages that link here¶

Legacy System Archaeology - Primer¶

Why This Matters¶

Core Concepts¶

1. The Archaeological Approach¶

2. The First Day Survey¶

3. Reading Configs You Didn't Write¶

4. Tracing Dependencies Without Documentation¶

5. Understanding Cron Jobs and Timers¶

6. Finding the "Real" Config vs. What's in Git¶

7. Identifying Tribal Knowledge¶

8. Building a Mental Model¶

Common Pitfalls¶

Wiki Navigation¶

Prerequisites¶

Related Content¶

Pages that link here¶