Skip to content

Incident Replay: systemd Service Flapping

Setup

  • System context: Production worker service managed by systemd. The service starts, runs for 30-60 seconds, crashes, and systemd restarts it. This cycle has been repeating for 2 hours.
  • Time: Saturday 07:00 UTC
  • Your role: On-call SRE

Round 1: Alert Fires

[Pressure cue: "Worker service on worker-prod-03 is flapping — restarting every minute. Job queue is backing up. Downstream services are receiving partial results."]

What you see: systemctl status worker shows "Active: activating (auto-restart)" with 127 restart attempts. journalctl -u worker --since '1 hour ago' shows the service starting, processing a few jobs, then segfaulting.

Choose your action: - A) Disable the service auto-restart to stop the flapping - B) Check the segfault details in the logs and core dumps - C) Increase the restart limit to give the service more chances - D) Roll back to the previous version of the worker binary

[Result: coredumpctl list shows 127 core dumps. coredumpctl info on the latest shows the segfault occurs in a JSON parsing function. The crash is triggered by a specific malformed message in the job queue. Every time the service restarts, it picks up the same poison message and crashes. Proceed to Round 2.]

If you chose A:

[Result: Service stops flapping but is now completely down. The job queue continues to back up.]

If you chose C:

[Result: Already at 127 restarts. More restarts just mean more core dumps consuming disk space.]

If you chose D:

[Result: Previous version may also crash on the same poison message if the parsing bug existed before.]

Round 2: First Triage Data

[Pressure cue: "Poison message identified. The service crashes on every restart because it re-reads the same message."]

What you see: The worker service reads from a message queue (RabbitMQ). Message at position 347 is a malformed JSON that triggers a buffer overflow in the parser. The service has no dead-letter queue and no message skip/retry limit. It reads the message, crashes, and on restart reads it again.

Choose your action: - A) Manually remove the poison message from the queue - B) Add a dead-letter queue configuration and requeue the message there - C) Skip the message using RabbitMQ management API and restart the service - D) Fix the JSON parser to handle malformed input

[Result: Using the RabbitMQ management API, you reject the poison message and move it to a dead-letter queue for later analysis. Service restarts cleanly and begins processing the queue normally. Proceed to Round 3.]

If you chose A:

[Result: Deleting the message loses it entirely. If it contained important data, it is gone.]

If you chose B:

[Result: DLQ configuration is correct long-term but requires a config change and service redeploy. Not the quickest path during an incident.]

If you chose D:

[Result: Parser fix is the right root cause fix but requires development, testing, and deployment. Not an incident-time fix.]

Round 3: Root Cause Identification

[Pressure cue: "Service processing normally. Fix the underlying issues."]

What you see: Root cause: (1) JSON parser has a buffer overflow on malformed input — it crashes instead of returning an error. (2) No dead-letter queue configured — poison messages are retried infinitely. (3) systemd restart policy has no rate limiting — the service flaps indefinitely without intervention.

Choose your action: - A) Fix the parser, add DLQ, and add systemd restart rate limiting - B) Just fix the parser — it should handle bad input - C) Just add DLQ — it catches poison messages regardless of the parser - D) All three plus add a circuit breaker pattern

[Result: Parser fixed to return errors on bad input. DLQ configured with 3-retry limit. systemd StartLimitIntervalSec=300 and StartLimitBurst=5 added. Circuit breaker pattern in the service code. Proceed to Round 4.]

If you chose A:

[Result: Good set of fixes but the circuit breaker adds resilience at the application level too.]

If you chose B:

[Result: Parser fix prevents this specific crash but other bad inputs could cause different crashes.]

If you chose C:

[Result: DLQ prevents infinite retry but the crash still happens — just less visibly.]

Round 4: Remediation

[Pressure cue: "Service stable. Deploy fixes and close."]

Actions: 1. Verify service is running stably: systemctl status worker 2. Verify job queue is draining: check queue depth in RabbitMQ dashboard 3. Deploy parser fix and DLQ configuration 4. Add systemd restart rate limiting to the unit file 5. Analyze the poison message to understand how it was generated 6. Add message validation at the producer side

Damage Report

  • Total downtime: 0 (service was up intermittently; jobs were queued)
  • Blast radius: Job processing delayed by 2 hours; downstream services received partial results
  • Optimal resolution time: 10 minutes (identify poison message -> skip it -> service recovers)
  • If every wrong choice was made: 4+ hours of flapping with growing queue backlog

Cross-References