Legacy System Archaeology Footguns¶

Mistakes that break inherited systems, destroy undocumented workflows, and turn your first week into your last.

1. "Cleaning up" on day one¶

You inherit a server. There are 15 cron jobs, a dozen scripts in /opt/scripts, and config files that look like they haven't been touched in years. Your instinct says "clean this up." You disable three cron jobs that "look unused." Tuesday, the database backup stops. Thursday, the log rotation stops. Friday, the disk is full.

Fix: No changes in the first two weeks. Observe only. Inventory everything. For each artifact, document: what it does, when it runs, and what depends on it. Only then decide what to change — and change one thing at a time.

Remember: The "two-week freeze" rule: observe for two weeks before changing anything on an inherited system. Use that time to build an inventory: crontab -l, systemctl list-units --type=service, ss -tlnp, find /opt/scripts -type f. Map dependencies before touching anything.

2. Trusting the architecture diagram¶

The wiki has a beautiful architecture diagram from 2023. It shows three application servers behind a load balancer talking to one database. Reality: there are five app servers (two were added during a traffic spike and never documented), the load balancer has a second tier for websockets, and the database has a read replica that six services depend on. Your change based on the diagram breaks the undocumented components.

Fix: Verify the architecture diagram against reality using ss -tnp, systemctl, and network scans. Assume every diagram is outdated until proven current. Redraw the diagram from observation, not documentation.

3. Disabling a "departed employee" account that's actually a service account¶

John left the company. You find his user account is still active with 12 cron jobs, SSH keys to 8 servers, and ownership of critical data directories. You lock the account per security policy. Deployments stop. Backups fail. Monitoring goes dark. Everything ran as John.

Fix: Before disabling any account, run a full dependency check: cron jobs, file ownership, running processes, SSH key dependencies, and service unit files. Create a proper service account first, migrate all dependencies, verify, then disable the personal account.

4. Running config management against a server that drifted¶

You have Ansible playbooks. You run them against the inherited server "to bring it into compliance." The playbooks revert three hotfixes that were applied directly during incidents. The incidents return. You've just undone six months of tribal knowledge encoded as production config drift.

Fix: Always run config management in check/diff mode first: ansible-playbook --check --diff. Review every proposed change. If the production config differs from the playbook, investigate why before assuming the playbook is correct. Production is the source of truth.

Gotcha: Config drift is not always accidental. Hotfixes applied during incidents encode hard-won knowledge: "this sysctl setting prevents the OOM killer from targeting the database," "this iptables rule blocks a specific attack pattern." Reverting drift without understanding it means re-introducing the problems that caused the drift.

5. Deleting old log files to free space¶

Disk is at 95%. You find 40GB of logs in /var/log/myapp/. You rm -rf them. The application crashes because it was writing to a log file that no longer exists and can't recreate the directory. Or: the application doesn't crash, but the open file handle keeps the space allocated (lsof shows deleted files), so you freed zero bytes.

Fix: Use log rotation, not deletion. For immediate space recovery: truncate open files instead of deleting them (> /var/log/myapp/current.log). For closed files: verify nothing references them, then delete. Better yet: set up logrotate and solve the problem permanently.

6. Assuming the backup works because a backup cron job exists¶

You find a cron job that runs pg_dump nightly. You check the "backups exist" box. Six months later, you need to restore. The backup file is 0 bytes because the cron job has been failing silently (database password changed, output going to /dev/null). Nobody noticed because nobody tested the restore.

Fix: A backup that hasn't been restored is not a backup — it's a hope. Test every backup you inherit: restore it to a scratch environment and verify the data. Check the cron job's error handling: does it alert on failure? Does it verify the backup size? Add both if missing.

7. Upgrading a dependency without understanding the blast radius¶

You notice the server runs PostgreSQL 11 (EOL). You schedule an upgrade to PostgreSQL 16. The upgrade breaks three applications that use deprecated features, a monitoring integration that depends on pg_stat views that changed, and a backup script that uses deprecated pg_dump flags.

Fix: Before upgrading any shared dependency: inventory every consumer. Check each consumer's compatibility with the new version. Test the upgrade in a staging environment that mirrors production's full dependency graph. Upgrade in stages: compatibility mode first, then full migration.

8. Ignoring the weird scripts in /usr/local/bin¶

You find a dozen scripts in /usr/local/bin with names like fix-stuck-jobs.sh, restart-if-high-mem.sh, and clear-temp-tables.sh. They look hacky. You ignore them as technical debt. A week later, jobs are stuck, memory is high, and temp tables are filling the database. Those scripts were the immune system.

Fix: Treat every script in /usr/local/bin, /opt/scripts, and /root as documentation of a problem that was never properly fixed. Read each script. Understand what problem it solves. Document it. Then decide: automate it properly, or keep running the script until you can fix the root cause.

Under the hood: These scripts are the system's immune system. fix-stuck-jobs.sh means "jobs get stuck regularly." restart-if-high-mem.sh means "there is a memory leak nobody fixed." clear-temp-tables.sh means "the database accumulates temp data that is never cleaned up." Each script documents a failure mode. Removing the script does not fix the failure — it removes the workaround.

9. Connecting your laptop to the production network to "test something"¶

You need to test connectivity to the production database. You connect your laptop directly to the production network. Your laptop has a DHCP server running (from a previous lab). It starts handing out IP addresses on the production network. Three servers lose their IP assignments.

Fix: Never connect untested devices to production networks. Use a jump box or bastion host. If you must test network connectivity, use a read-only method (ping, traceroute, telnet to a port) from an existing production host, not a personal device.

10. Not documenting what you've learned¶

You spend three weeks reverse-engineering the system. You now understand the architecture, the dependencies, the cron jobs, and the failure modes. You hold this knowledge in your head. You get pulled to another project. Six months later, someone new inherits the system and starts from zero. You've become the tribal knowledge you were trying to eliminate.

Fix: Document as you discover. Not a polished wiki page — a living notes file. Update it every time you learn something. Include: system map, dependency list, cron job inventory, known issues, and "things that will bite you." The document doesn't need to be perfect. It needs to exist.

Remember: The "bus factor" for legacy systems is always 1. If you hold the knowledge in your head, you ARE the tribal knowledge. A rough NOTES.md with ## What I learned this week entries is infinitely more valuable than a polished wiki page that never gets written.