Skip to content

The Zombie Cron Job

Category: The Incident Domains: cron, linux-ops Read time: ~5 min


Setting the Scene

I was an SRE at a mid-size e-learning platform — about 300 employees, 2 million active students. We'd been around for eight years, which in startup years is geological time. The codebase had layers like sedimentary rock. There were scripts nobody remembered writing, services nobody remembered deploying, and cron jobs nobody remembered scheduling. This story is about one of those cron jobs.

What Happened

Monday morning — The data team opens a ticket: "Student progress records are disappearing. Users report losing completed lessons. Seems intermittent — happens overnight, not during business hours."

Monday afternoon — I check the database. The lesson_completions table shows normal INSERT activity during the day. But at 3:15 AM UTC every night, there's a massive DELETE — about 40,000 rows removed. Then users re-complete lessons the next day, and the cycle repeats.

Tuesday — I search the application code for any DELETE on lesson_completions. Nothing. I check the deployment pipeline for any nightly job. Nothing. I search our Airflow DAGs. Nothing. Our cron monitoring tool (Cronitor) shows no job at 3:15 AM.

Tuesday evening — I SSH into the application servers and run crontab -l for every user. Nothing suspicious. Then I check /etc/cron.d/. And there it is: a file called cleanup_stale_sessions dated from 2019. Inside:

15 3 * * * appuser /opt/scripts/cleanup_old_records.sh

Tuesday evening, continued — I read the script. It was written by a developer who left two years ago. Originally it cleaned up a stale_sessions table that no longer exists. At some point, someone modified the script to also clean up lesson_completions records older than 30 days — presumably thinking that was a temp table too. The modification wasn't in version control. The script lived in /opt/scripts/ on a single server, not managed by our config management.

Wednesday — I confirm the damage. The job has been running every night for approximately 14 months. Every student who completed a lesson more than 30 days ago had their completion record deleted. Students had to redo lessons, customer support had been fielding "lost progress" tickets for over a year, and nobody connected it to a cron job because the symptoms were intermittent and the deletions happened at 3 AM.

The Moment of Truth

Fourteen months. A script nobody remembered, running on a schedule nobody tracked, deleting data from a table that had changed purpose since the script was written. The scariest part: if the data team hadn't happened to look at the right table at the right time, it could have continued indefinitely.

The Aftermath

I removed the cron job and script immediately. We couldn't recover the deleted records — they were gone, no backups contained them since the backup retention was 90 days and the deletions started 14 months ago. We ran a full audit of every cron job on every server: found 23 cron jobs across 8 servers, of which 6 were either broken, pointless, or actively harmful. We migrated all scheduled tasks to Airflow with mandatory documentation, alerting, and code review. And we added a weekly automated scan that diffs crontabs against a known-good inventory and alerts on any unknown jobs.

The Lessons

  1. Audit your cron jobs regularly: Unknown cron jobs are time bombs. Maintain an inventory, review it quarterly, and alert on any job that appears outside your config management.
  2. Every cron job needs an owner and documentation: If nobody can explain what a scheduled task does and why, it should be disabled (not deleted — disabled, so you can re-enable if something breaks).
  3. Use job schedulers with visibility: crontab is invisible by default. Tools like Airflow, Rundeck, or even systemd timers with journald give you logging, history, alerting, and a UI — all things cron doesn't provide.

What I'd Do Differently

I'd implement a "cron census" as a standard part of onboarding to any new team or codebase. Before I write a single line of code, I'd inventory every scheduled task and understand what it does. I'd also enforce that all scheduled jobs run through config management — anything in crontab that isn't in Ansible/Puppet/Chef gets flagged and investigated automatically.

The Quote

"The scariest production incidents aren't the ones that wake you up at 3 AM. They're the ones that have been running at 3 AM for fourteen months without anyone noticing."

Cross-References