Legacy System Archaeology - Street-Level Ops¶
What experienced ops engineers know about inheriting systems nobody understands and making sense of them before something breaks.
Quick Diagnosis Commands¶
One-liner:
lsof -i -P -n | awk '{print $1, $9}' | sort -u-- instant map of every process talking to the network.
# The "I just got access to a server I know nothing about" starter pack
# Identity and age
hostname && cat /etc/os-release | head -5
uptime && who -b
uname -r # Kernel version (old = risky)
# What does this server DO?
systemctl list-units --type=service --state=running --no-pager | grep -v systemd
ss -tlnp # What's listening?
ss -tnp | awk '{print $5}' | cut -d: -f1 | sort -u # What does it talk to?
# What's custom here?
rpm -qa --last 2>/dev/null | head -20 # Recent installs (RHEL)
find /opt /srv -maxdepth 2 -type f -name '*.conf' -o -name '*.yml' 2>/dev/null
# Who has been here?
last -20 # Recent logins
ls -lt /home/ 2>/dev/null # Active user directories
find /root -maxdepth 1 -type f -mtime -90 2>/dev/null # Root's recent files
# What runs on a schedule?
crontab -l 2>/dev/null
ls /etc/cron.d/ /etc/cron.daily/ 2>/dev/null
systemctl list-timers --no-pager
# Where are the logs?
ls -lh /var/log/*.log /var/log/*/*.log 2>/dev/null | sort -k5 -rh | head -10
journalctl --disk-usage
# What's eating disk space?
df -h && du -sh /var/log /opt /srv /tmp 2>/dev/null
Gotcha: The Config That Isn't¶
You find /etc/myapp/config.yml. You read it carefully. You plan changes based on it. You make the change. Nothing happens. The process doesn't read that file — it reads /opt/myapp/conf/app.conf because someone passed a --config flag in the systemd unit file.
Fix: Always verify which config file the process actually uses:
# Check the process command line
cat /proc/$(pgrep myapp)/cmdline | tr '\0' ' '
# Output: /opt/myapp/bin/myapp --config /opt/myapp/conf/app.conf
# Check the systemd unit file
systemctl cat myapp
# Look for: ExecStart= line with --config, -c, or -f flags
# Check for environment variables overriding config paths
cat /proc/$(pgrep myapp)/environ | tr '\0' '\n' | grep -i config
# Check what files the process has open
lsof -p $(pgrep myapp) | grep -E '\.(conf|cfg|yml|yaml|json|ini|properties)'
Gotcha: The Cron Job That Holds Everything Together¶
You're cleaning up cron jobs. One of them runs curl http://localhost:8080/internal/cleanup every hour. No documentation. No comments. You disable it to "see what happens." A week later, the database table has 50 million rows, the app is slow, and disk is filling up. The cron job was the only thing preventing unbounded data growth.
Fix: Before disabling any cron job, investigate what it does:
# Read the script or command
cat /etc/cron.d/mysterious-job
# If it calls a script, read the script
cat /opt/scripts/cleanup.sh
# If it calls a URL, check what that endpoint does
curl -v http://localhost:8080/internal/cleanup 2>&1 | head -20
# Check when it last ran successfully
# Look for output in /var/mail, /var/log/cron, or syslog
grep "mysterious-job\|cleanup" /var/log/cron 2>/dev/null | tail -5
journalctl -u cron --since "24 hours ago" | grep cleanup
# Disable by commenting, not deleting. Add a note.
# #DISABLED 2026-03-15 investigating purpose - @eric
# */60 * * * * curl http://localhost:8080/internal/cleanup
Gotcha: The User Account That's Actually a Service¶
You find a user account "jsmith" with a home directory full of scripts, a crontab with 15 entries, and SSH keys that other servers depend on. You disable the account because jsmith left the company two years ago. Half the infrastructure breaks because deployments, backups, and monitoring all ran as jsmith.
Fix: Before disabling any user account:
# Check if the user has a crontab
crontab -l -u jsmith 2>/dev/null
# Check if any services run as this user
ps aux | grep jsmith
systemctl list-units --type=service | xargs -I{} systemctl show {} -p User | grep jsmith
# Check if any SSH authorized_keys reference this user
grep -r "jsmith" /home/*/.ssh/authorized_keys 2>/dev/null
# Check if other servers connect via this user's SSH keys
cat /home/jsmith/.ssh/known_hosts 2>/dev/null | wc -l
# Check for files owned by this user outside home
find /opt /srv /etc -user jsmith 2>/dev/null | head -20
# Proper procedure: create a service account, migrate everything,
# then disable the personal account
Gotcha: Git Says One Thing, Production Says Another¶
The git repo has a clean, well-documented Ansible setup. You assume production matches. It doesn't. Someone applied a hotfix directly to production during an incident six months ago and never committed it. Your next Ansible run will revert the hotfix and re-create the incident.
Fix: Before running any config management against an inherited system:
# Diff the actual state against what config management thinks
# For Ansible: run in check mode with diff
ansible-playbook site.yml --check --diff --limit target-server
# For files managed by packages, check for local modifications
rpm -Va 2>/dev/null | grep -v "^..5" | head -20 # RHEL
dpkg -V 2>/dev/null | head -20 # Debian
# For specific critical configs
diff /etc/nginx/nginx.conf <(git -C /path/to/repo show HEAD:nginx/nginx.conf)
# Tag the current state before making any changes
# Create a backup of every config you plan to touch
mkdir -p /root/config-backup-$(date +%Y%m%d)
cp -a /etc/nginx /root/config-backup-$(date +%Y%m%d)/
cp -a /etc/myapp /root/config-backup-$(date +%Y%m%d)/
Pattern: The 30-Day Observation Period¶
War story: An engineer inherited a "legacy" server and immediately started "cleaning up." He disabled an ancient Perl CGI script running on port 8888. Turns out the billing department had been using it daily for 11 years to generate invoices. Nobody in engineering knew it existed.
Week 1: Observe and inventory
- Run the first-day survey
- Inventory all services, cron jobs, and users
- Map network connections and dependencies
- Read all config files and their comments
- DO NOT CHANGE ANYTHING
Week 2: Understand and document
- Draw the system architecture from what you've observed
- Document each component's purpose, config, and dependencies
- Identify discrepancies between documentation and reality
- Talk to people who interact with the system (support, product, other teams)
Week 3: Identify risks and quick wins
- Flag: what will break next? (disk filling, cert expiring, EOL software)
- Flag: what's missing? (monitoring, backups, documentation)
- Identify low-risk improvements that build confidence
Week 4: Start small changes
- Fix one monitoring gap
- Update one stale runbook
- Add one missing alert
- Make one small, reversible improvement
- Document everything you do
Pattern: Reverse-Engineering Monitoring¶
The monitoring tells you what the previous team thought was important.
Step 1: Inventory existing monitoring
- What dashboards exist? What do they show?
- What alerts are configured? What thresholds?
- What's monitored that probably shouldn't be? (noise)
- What's NOT monitored that should be? (gaps)
Step 2: Read the alert history
- Which alerts fire most often? (the real problems)
- Which alerts are always silenced? (noise or accepted risk)
- Which alerts were added after incidents? (the lessons learned)
Step 3: Infer the system model
- If they monitor database connection count, connections were a problem
- If they monitor disk on /var/log, log growth was a problem
- If they have a "manual restart needed" alert, auto-restart doesn't work
- The monitoring gaps reveal what they never had problems with
(or what they never caught)
Pattern: Building the Dependency Graph¶
# Automated dependency discovery script
#!/bin/bash
echo "=== Services ==="
systemctl list-units --type=service --state=running --no-pager | grep -v systemd
echo ""
echo "=== Network Listeners ==="
ss -tlnp | grep -v "State"
echo ""
echo "=== Outbound Connections ==="
ss -tnp | awk '{print $5}' | cut -d: -f1 | sort -u | while read ip; do
host=$(getent hosts "$ip" 2>/dev/null | awk '{print $2}')
ports=$(ss -tnp | grep "$ip" | awk '{print $5}' | cut -d: -f2 | sort -u | tr '\n' ',')
echo " → ${host:-$ip} ports: ${ports%,}"
done
echo ""
echo "=== Systemd Dependencies ==="
for svc in $(systemctl list-units --type=service --state=running --no-pager | awk '{print $1}' | grep -v "^$\|UNIT\|loaded"); do
deps=$(systemctl show "$svc" -p After --no-pager 2>/dev/null | grep -oP '[\w-]+\.service' | grep -v "systemd\|basic\|sysinit" | tr '\n' ', ')
if [ -n "$deps" ]; then
echo " $svc depends on: ${deps%,}"
fi
done
echo ""
echo "=== Mount Points ==="
findmnt -t nfs,nfs4,cifs,ext4,xfs --noheadings 2>/dev/null
echo ""
echo "=== DNS Resolver ==="
cat /etc/resolv.conf | grep -v '^#'
Emergency: You Broke Something on a System You Don't Understand¶
You made a change. Something broke. You're not sure what the previous state was. Panic is setting in.
1. Stop making changes. Do not try to "fix forward."
2. Check if you have a backup:
- Did you copy the config before changing it?
- Is there a config management system that has the previous state?
- Does the package manager have the original? (rpm -qf /etc/myapp.conf)
3. Check revision control:
- etckeeper: cd /etc && git log --oneline -5
- Ansible last run: check the Ansible control node's logs
- Config management: puppet agent --test --noop
4. Check the system journal for the previous state:
journalctl -u myservice --since "1 hour ago"
# Error messages often contain the config values that failed
5. If all else fails, check other servers:
- Is there another server running the same service?
- Compare its config to yours
- The diff shows what you changed
6. After recovery, document:
- What was the previous state?
- What did you change?
- What broke?
- How did you recover?
- What should you have done instead?
Debug clue: If
etckeeperis installed (dpkg -l etckeeperorrpm -q etckeeper), you have a full git history of every/etcchange ever made. Runcd /etc && git log --oneline -20-- this is the single most valuable archaeology tool on any inherited server.
Emergency: Production Server Nobody Has Credentials For¶
You've inherited a server. The previous admin left. Root password is unknown. SSH keys are from departed employees. You need to get in.
1. Check if config management has access:
- Ansible: does the inventory include this host?
ansible <hostname> -m ping
- Puppet/Chef: does the agent still check in?
2. Check if another account works:
- Try known service accounts
- Try LDAP/SSO credentials if the server is domain-joined
- Check password vault (1Password, Vault, CyberArk)
3. If you have physical/console access (datacenter/IPMI):
- Boot into single-user mode (add init=/bin/bash to GRUB)
- Reset the root password: passwd root
- Reboot normally
4. If it's a VM:
- Mount the disk on another VM
- Edit /etc/shadow to clear the root password
- Or add your SSH key to /root/.ssh/authorized_keys
- Remount and boot
5. If it's a cloud instance:
- Use the cloud provider's serial console
- Create an AMI/snapshot, launch new instance, mount disk
6. After access is restored:
- Change all passwords and rotate all keys
- Audit who has access
- Set up proper credential management going forward
- Document the access recovery procedure for next time