Incident Triage Footguns¶

Mistakes that extend outages, erode trust, or make the same incident happen again.

1. Working silently without communicating¶

You spend 45 minutes heads-down debugging. Nobody knows what you have tried. Stakeholders assume nothing is happening. A second engineer starts from scratch because they do not know you are already on it. Meanwhile, the status page still says everything is fine.

Fix: Post an update within 10 minutes of acknowledging. Update every 15 minutes during SEV-1/2. Even "still investigating, no new findings" is better than silence.

2. Declaring root cause too early¶

Five minutes in, you see a database error and announce "the database is the root cause." Everyone pivots to the database. Thirty minutes later you discover the database error was a symptom of a network partition. You wasted the DBA's time and delayed the real fix.

Fix: Say "we are seeing database errors" (symptom), not "the database is the root cause" (conclusion). Root cause comes from the postmortem, not from the first 10 minutes of investigation.

3. Hero mode instead of escalating¶

You have been debugging alone for 30 minutes on a SEV-1. You think you are close. You do not want to wake up the on-call DBA. Another 30 minutes passes. The outage is now an hour long. The DBA could have fixed it in 5 minutes because they have seen this before.

Fix: Set a hard timer: if you have not resolved a SEV-1 in 15 minutes, escalate. Escalation is not admitting failure — it is getting the right expertise to the right problem. Two heads are always faster than one at 3 AM.

4. Tunnel vision on a single hypothesis¶

You see a memory spike and spend 20 minutes investigating a memory leak. You ignore the CPU graph showing 100% utilization, the deploy that happened 5 minutes before the incident, and the fact that only one region is affected. The actual cause was a bad config push.

Fix: Before deep-diving, check all dimensions: recent changes, affected scope, multiple metrics. Use a checklist approach. If your hypothesis does not explain all symptoms, it is probably wrong.

5. Not verifying the alert is real¶

An alert fires: "Error rate > 5%." You page the team, open an incident channel, update the status page. Then you discover it was a monitoring glitch — a scrape failure caused a data gap that looked like errors. You burned trust and caused alert fatigue.

Fix: Spend 60 seconds verifying from multiple sources before escalating. Check the service health endpoint directly. Check a different monitoring tool. Confirm with a manual test.

6. Skipping the rollback¶

A deployment went out 20 minutes ago. Errors spike 15 minutes later. Instead of rolling back, you try to fix forward — deploying a hotfix. The hotfix has its own bug. Now you are two bad deploys deep and further from recovery.

Fix: When a recent deployment correlates with an incident, rollback first, investigate later. Rollbacks are fast and safe. Hotfixes under pressure introduce new risks. Get to a known-good state, then debug.

7. No severity classification¶

Every incident gets the same response: one engineer looking at it when they have time. A SEV-4 cosmetic issue gets the same attention as a SEV-1 total outage. Real emergencies queue behind noise.

Fix: Classify severity within the first 2 minutes using a decision tree. Different severities get different response levels: SEV-1 gets all-hands immediately, SEV-4 waits until business hours. Publish your classification criteria so everyone agrees.

8. No postmortem on recurring incidents¶

The same incident happens three times in two months. Each time it is resolved quickly because the team "knows how to fix it." Nobody writes a postmortem. Nobody addresses the underlying cause. The fourth occurrence happens during a holiday weekend when the expert is unavailable.

Fix: Conduct a blameless postmortem for every SEV-1/2 within 48 hours. Track action items to completion. If the same incident recurs, the postmortem's action items were not sufficient — tighten them.

9. Forgetting to update the status page¶

The incident is resolved at 2:15 AM. You go back to sleep. The status page still shows "Major Outage" at 9 AM. Support is flooded with tickets. Customers think the service is still down.

Fix: Status page updates are part of the resolution checklist. Before closing an incident: update status page, post final update in incident channel, notify stakeholders. Automate status page updates if possible.

10. Making changes without a rollback plan¶

You try a fix during the incident: restarting a database, changing a config, scaling down a service. The fix makes things worse. You do not remember the original config values. Now you are fighting two problems.

Fix: Before making any change during an incident, note what the current state is and how to revert. Use kubectl rollout undo, git revert, or write down the original config value. Every action during an incident should be reversible.