Skip to content

Cron & Job Scheduling - Street-Level Ops

Quick Diagnosis Commands

When a scheduled job is not running or misbehaving:

# 1. Is cron running?
systemctl status cron        # Debian/Ubuntu
systemctl status crond       # RHEL/CentOS

# 2. List the user's crontab
crontab -l
crontab -l -u deploy

# 3. Check system cron jobs
ls -la /etc/cron.d/
cat /etc/crontab

# 4. Check cron logs
grep CRON /var/log/syslog         # Debian/Ubuntu
grep CRON /var/log/cron           # RHEL/CentOS
journalctl -u cron --since "1 hour ago"

# 5. Check if the job ran (look for recent entries)
grep 'backup' /var/log/syslog | tail -20

# 6. List all systemd timers
systemctl list-timers --all

# 7. Check a specific timer
systemctl status backup.timer
systemctl status backup.service

# 8. View the last run output
journalctl -u backup.service --since "24 hours ago" --no-pager

Pattern: Debugging Why a Cron Job Did Not Run

Work through this checklist systematically:

# 1. Is the cron daemon running?
systemctl is-active cron

# 2. Does the crontab entry exist?
crontab -l | grep backup

# 3. Is the schedule correct? (use crontab.guru to verify)
# E.g., "0 2 * * *" = 2:00 AM daily, NOT 2:00 PM

# 4. Does the command work when run manually?
sudo -u deploy /usr/local/bin/backup.sh

# 5. Is the PATH set correctly in cron?
# Add this to debug:
# * * * * * env > /tmp/cron-env.txt
# Then compare with your login shell env

# 6. Does the script have execute permission?
ls -la /usr/local/bin/backup.sh

# 7. Is the script using bash features but cron runs /bin/sh?
head -1 /usr/local/bin/backup.sh
# Should be: #!/bin/bash (not #!/bin/sh if using bash features)

Remember: Cron debug checklist mnemonic: P-E-S-P -- PATH set? Execute permission? SHELL correct? PATH to binary absolute? Work through these four and you will find the cause of 90% of cron failures.

# 8. Check if output is being swallowed
# Temporarily add output redirection:
# 30 2 * * * /usr/local/bin/backup.sh >> /tmp/backup-debug.log 2>&1

# 9. Check if the job is being blocked by flock
ls -la /var/lock/*.lock

# 10. Check cron allow/deny files
cat /etc/cron.allow 2>/dev/null
cat /etc/cron.deny 2>/dev/null

Gotcha: The Cron Environment Is Not Your Shell

This causes 90% of cron failures. Your script works perfectly when you run it manually but fails in cron.

# What you think cron runs:
PATH=/usr/local/bin:/usr/bin:/bin:/home/user/go/bin
HOME=/home/user
SHELL=/bin/bash
# Plus all your .bashrc exports

# What cron actually runs:
PATH=/usr/bin:/bin
HOME=/home/user
SHELL=/bin/sh
# Nothing else

# Debug it: add this temporarily to your crontab
* * * * * env > /tmp/cron-env.txt 2>&1
# Wait 1 minute, then compare:
diff <(env | sort) <(sort /tmp/cron-env.txt)

Fixes

# Fix 1: Set PATH at the top of crontab
PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/go/bin
30 2 * * * backup.sh

# Fix 2: Use absolute paths everywhere
30 2 * * * /usr/local/bin/python3 /opt/scripts/backup.py

# Fix 3: Wrapper script that sources the environment
#!/bin/bash
source /home/deploy/.bashrc
/opt/scripts/backup.sh

Pattern: Systemd Timer Status Check

# Show all timers sorted by next trigger
systemctl list-timers --all

# Output:
# NEXT                         LEFT          LAST                         PASSED       UNIT
# Sat 2024-03-16 02:30:00 UTC  12h left      Fri 2024-03-15 02:30:00 UTC  11h ago      backup.timer
# Sat 2024-03-16 00:00:00 UTC  10h left      Fri 2024-03-15 00:00:00 UTC  14h ago      cleanup.timer

# Check if a timer missed its schedule
systemctl status backup.timer
# Look for "Trigger:" line

# Validate OnCalendar expressions
systemd-analyze calendar "Mon..Fri *-*-* 09:00:00"
# Shows the next 5 trigger times

# View journal for a timer service
journalctl -u backup.service -n 50

# Manually trigger the service (test without waiting)
systemctl start backup.service
journalctl -u backup.service -f

Gotcha: Cron Jobs Overlapping

Your job runs every 5 minutes but sometimes takes 15 minutes. Now you have 3 copies running simultaneously, fighting over the same resources.

# Detect overlapping jobs
ps aux | grep backup | grep -v grep

# If you see multiple instances, you have a problem

# Fix with flock (file locking)
*/5 * * * * flock -n /var/lock/backup.lock /usr/local/bin/backup.sh
# -n = non-blocking: if lock exists, exit immediately

# Fix with flock + timeout
*/5 * * * * flock -w 60 /var/lock/backup.lock /usr/local/bin/backup.sh
# -w 60 = wait up to 60 seconds for the lock

# Verify the lock mechanism works
flock -n /var/lock/backup.lock echo "got lock"

Gotcha: DST (Daylight Saving Time) and Cron

Classic cron uses the system's local timezone. When DST changes:

Spring forward (2 AM -> 3 AM):
  Jobs scheduled between 2:00-2:59 AM DO NOT RUN

Fall back (2 AM -> 1 AM):
  Jobs scheduled between 1:00-1:59 AM RUN TWICE
# Check your system timezone
timedatectl

# Fix 1: Run critical jobs in UTC
TZ=UTC
30 2 * * * /usr/local/bin/backup.sh

# Fix 2: Use systemd timers (handle DST better)
# OnCalendar uses the system timezone but handles transitions

# Fix 3: Schedule at times not near DST transitions
# Use 3:30 AM or 4:00 AM instead of 2:00 AM

# Fix 4: Make jobs idempotent so running twice is safe

Pattern: Kubernetes CronJob Monitoring

# List all CronJobs
kubectl get cronjobs -A

# Check last schedule time
kubectl get cronjob database-backup -o jsonpath='{.status.lastScheduleTime}'

# List recent Jobs created by the CronJob
kubectl get jobs --sort-by=.metadata.creationTimestamp | grep database-backup | tail -10

# Check failed jobs
kubectl get jobs --field-selector status.successful=0 | grep database-backup

# View logs from the most recent job
kubectl logs job/$(kubectl get jobs --sort-by=.metadata.creationTimestamp \
  -o jsonpath='{.items[-1].metadata.name}' -l app=database-backup)

# Describe CronJob for events and conditions
kubectl describe cronjob database-backup

# Check if jobs are piling up (concurrencyPolicy issue)
kubectl get pods -l job-name --field-selector=status.phase=Running

# Manually trigger a CronJob (create a job from the CronJob template)
kubectl create job --from=cronjob/database-backup manual-backup-test

Gotcha: Kubernetes CronJob Timezone

Before Kubernetes 1.27, CronJobs used the kube-controller-manager timezone (usually UTC). If you set schedule: "0 9 * * *" expecting 9 AM local time but the cluster runs UTC, the job runs at 9 AM UTC.

# K8s 1.27+: use timeZone field
spec:
  schedule: "0 9 * * *"
  timeZone: "America/New_York"

# Pre-1.27: calculate the UTC offset yourself
# 9 AM EST = 14:00 UTC
spec:
  schedule: "0 14 * * *"

Pattern: CronJob History and Cleanup

# Keep last 3 successful and 5 failed jobs
spec:
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5
# If history limits are not set, jobs pile up
kubectl get jobs | wc -l
# If this returns hundreds, you have a cleanup problem

# Manual cleanup of completed jobs
kubectl delete jobs --field-selector status.successful=1

# Clean up failed jobs older than 1 day
kubectl get jobs --field-selector=status.successful=0 -o json | \
  jq -r '.items[] | select(.status.startTime < (now - 86400 | todate)) | .metadata.name' | \
  xargs kubectl delete job

Pattern: at and batch for One-Time Jobs

# Schedule a one-time job
echo "/usr/local/bin/migration.sh" | at 02:00 AM tomorrow

# Schedule relative to now
echo "/usr/local/bin/task.sh" | at now + 2 hours

# List pending at jobs
atq

# View a specific job
at -c 42

# Remove a pending job
atrm 42

# batch: run when system load drops below 1.5
echo "/usr/local/bin/heavy-task.sh" | batch

Gotcha: MAILTO and Silent Failures

By default, cron emails the output of every job to the user. If mail is not configured (which is common on modern servers), the output is silently discarded — including error messages.

# Check if mail is working
echo "test" | mail -s "cron test" ops@example.com

# If mail is not configured, redirect output explicitly
30 2 * * * /usr/local/bin/backup.sh >> /var/log/backup.log 2>&1

# Or set MAILTO to empty to suppress mail, and handle logging yourself
MAILTO=""
30 2 * * * /usr/local/bin/backup.sh >> /var/log/backup.log 2>&1

# Never redirect to /dev/null unless you truly do not care about failures
# This hides ALL errors:
30 2 * * * /usr/local/bin/backup.sh > /dev/null 2>&1

War story: A backup cron job was redirecting to /dev/null for three years. The backup script had started failing silently six months ago due to an expired credential. No one noticed until a disk failure required a restore -- and the most recent valid backup was six months old. Always log output to a file and monitor the log for staleness. ```text


Pattern: Monitoring Scheduled Jobs

Jobs need monitoring like any other service. A backup that silently fails for a week is not a backup.

```bash

Simple: check for recent output in log

[ -f /var/log/backup.log ] && \ find /var/log/backup.log -mmin -1440 -print | grep -q . || \ echo "ALERT: backup log not updated in 24 hours"

Better: use a dead man's switch / heartbeat service

At the end of your script, ping a monitoring endpoint:

curl -fsS --retry 3 https://hc-ping.com/your-uuid-here

Services that do this:

- Healthchecks.io

- Cronitor

- Dead Man's Snitch

- PagerDuty heartbeat checks

```bash

The monitoring service alerts you if the ping does not arrive on schedule. This catches both job failures AND jobs that never start.