The Cost of No Staging¶
Category: The Hard Lesson Domains: environments, ci-cd Read time: ~5 min
Setting the Scene¶
We were a 30-person startup. Scrappy, fast, deploying to production six times a day. We had two environments: local dev (your laptop) and production. "We test in prod" was said half-ironically, half-seriously. Staging had been in the roadmap since Series A. We were now Series C. The roadmap had moved, but the staging line item hadn't.
Our stack was a Node.js API backed by MySQL 8 on RDS, serving a React frontend to about 200,000 monthly active users.
What Happened¶
A product requirement came in to restructure the user profile schema. The old table had grown to 47 columns — a classic "add one more field" situation. The new design normalized it into three tables: users, user_profiles, and user_preferences. Clean, proper, overdue.
The migration was written in Knex.js. The developer tested it locally against a MySQL instance with 500 seed records. It ran in 1.2 seconds. Looked great in code review. The rollback was included. We merged it.
Our deploy pipeline ran knex migrate:latest as part of the deploy step. At 2:15 PM on a Monday, the migration started running against production — a users table with 847,000 rows.
The first ALTER TABLE users DROP COLUMN bio acquired a metadata lock. In MySQL 8, DDL operations on large tables with ongoing transactions hold locks that block reads and writes. The API started queuing requests. Response times went from 45ms to 30 seconds within a minute.
Then the migration hit the CREATE TABLE user_profiles AS SELECT ... FROM users. On 847,000 rows, this took 4 minutes. During those 4 minutes, the users table was locked. The connection pool exhausted. The API returned 503 to every request.
But the real disaster was step three: ALTER TABLE users DROP COLUMN preferences_json. This column contained data that the INSERT INTO user_preferences step was supposed to copy first. The migration steps were in the wrong order. On local with 500 rows, the timing didn't matter — everything ran in a second. In production, the copy hadn't finished when the drop executed.
We lost the preferences_json column for 847,000 users. Notification settings, language preferences, dashboard layouts — gone.
The Moment of Truth¶
I ran SHOW COLUMNS FROM users and saw that preferences_json was gone. Then I checked the user_preferences table: 312,000 rows. We'd copied 37% of the data before the column was dropped. The other 535,000 users had their preferences reset to defaults. The rollback script couldn't restore a dropped column's data.
The Aftermath¶
We restored from an RDS snapshot taken 6 hours earlier and replayed the WAL. We lost 6 hours of signups and profile updates. Then we built a staging environment. It took 2 weeks. The cost was $1,200/month in AWS spend. The incident had cost an estimated $90,000 in engineering time, customer support, and a credits program for affected users.
We also added a rule: no DDL migration runs without first executing against a staging database with a production-sized dataset. We created a nightly mysqldump pipeline that populated staging with anonymized production data.
The Lessons¶
- Staging environments aren't luxury: They're the minimum viable infrastructure for any team that touches a database in production. If you can't afford staging, you can't afford the incident that not having staging will cause.
- The cost of staging is less than one production incident: $1,200/month for staging vs. $90,000 for one bad migration. We could have run staging for 6 years for the cost of that afternoon.
- Test destructive changes on production-sized data: A migration that runs in 1 second on 500 rows and 4 minutes on 847,000 rows is not the same migration. Volume changes behavior.
What I'd Do Differently¶
I'd set up staging before writing the first line of application code. Even a minimal environment — same schema, synthetic data, no HA — costs almost nothing and catches almost everything. I'd also add a CI step that runs every migration against a database seeded with at least 100K rows and fails if it takes longer than 30 seconds or acquires a table-level lock.
The Quote¶
"We saved $1,200 a month on staging and spent $90,000 on a Monday afternoon."
Cross-References¶
- Topic Packs: CI/CD Pipelines & Patterns, Environment Variables
- Case Studies: Deployment Stuck ImagePull Vault