Skip to content

Anti-Primer: Binary And Floats

Everything that can go wrong, will — and in this story, it does.

The Setup

A team is working with binary and floats in a production environment under tight deadlines. The senior engineer who designed the system is on leave, and the remaining team is implementing changes based on incomplete documentation. Pressure to deliver is overriding caution.

The Timeline

Hour 0: Skipping the Documentation

Jumps straight to implementation without reading the existing documentation or runbooks. The deadline was looming, and this seemed like the fastest path forward. But the result is repeats a known mistake that the documentation explicitly warns against; 4 hours wasted.

Footgun #1: Skipping the Documentation — jumps straight to implementation without reading the existing documentation or runbooks, leading to repeats a known mistake that the documentation explicitly warns against; 4 hours wasted.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: No Rollback Plan

Makes changes with no clear way to reverse them if things go wrong. Under time pressure, the team chose speed over caution. But the result is the change breaks production; rolling back requires 6 hours of manual work instead of one command.

Footgun #2: No Rollback Plan — makes changes with no clear way to reverse them if things go wrong, leading to the change breaks production; rolling back requires 6 hours of manual work instead of one command.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Testing in Production

Skips the staging environment because 'it is not an exact replica anyway'. Nobody pushed back because the shortcut looked harmless in the moment. But the result is a configuration error that staging would have caught causes a 2-hour production outage.

Footgun #3: Testing in Production — skips the staging environment because 'it is not an exact replica anyway', leading to a configuration error that staging would have caught causes a 2-hour production outage.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: Single Point of Knowledge

Only one person understands the system; no cross-training or documentation. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is that person is unavailable during an incident; the remaining team takes 5x longer to resolve it.

Footgun #4: Single Point of Knowledge — only one person understands the system; no cross-training or documentation, leading to that person is unavailable during an incident; the remaining team takes 5x longer to resolve it.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 Skipping the Documentation Repeats a known mistake that the documentation explicitly warns against; 4 hours wasted Primer: Read the primer and runbooks before making changes; past lessons are documented for a reason
2 No Rollback Plan The change breaks production; rolling back requires 6 hours of manual work instead of one command Primer: Always have a tested rollback procedure before making any production change
3 Testing in Production A configuration error that staging would have caught causes a 2-hour production outage Primer: Always test in staging first; imperfect testing is better than no testing
4 Single Point of Knowledge That person is unavailable during an incident; the remaining team takes 5x longer to resolve it Primer: Document everything; cross-train team members; no single points of failure in knowledge

Damage Report

  • Downtime: 2-4 hours of degraded or unavailable service
  • Data loss: Potential, depending on the failure mode and backup state
  • Customer impact: Visible errors, degraded performance, or complete outage for affected users
  • Engineering time to remediate: 8-16 engineer-hours across incident response and follow-up
  • Reputation cost: Internal trust erosion; possible external customer-facing apology

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on skipping the documentation, they would have learned: Read the primer and runbooks before making changes; past lessons are documented for a reason.
  • Footgun #2: If the engineer had read the primer, section on no rollback plan, they would have learned: Always have a tested rollback procedure before making any production change.
  • Footgun #3: If the engineer had read the primer, section on testing in production, they would have learned: Always test in staging first; imperfect testing is better than no testing.
  • Footgun #4: If the engineer had read the primer, section on single point of knowledge, they would have learned: Document everything; cross-train team members; no single points of failure in knowledge.

Cross-References