Skip to content

The Test We Never Wrote

Category: The Hard Lesson Domains: ci-cd, testing Read time: ~5 min


Setting the Scene

It was a Rails monolith serving about 40,000 requests per minute. We had 1,200 unit tests, all green, all fast. The team was proud of that coverage number — 87%. We'd talk about it in sprint reviews like it meant something.

What we didn't have was a single integration test that validated the contract between our API gateway and the downstream payment service. "The unit tests cover the serialization logic," I'd said in the PR review. "We'll add integration tests next sprint." That was eleven sprints ago.

What Happened

We shipped a refactor of the payment request builder on a Thursday afternoon. The PR looked clean — renamed some fields to match our new naming conventions, updated the unit tests, all green. CI passed in 4 minutes. We merged and deployed to production via our Jenkins pipeline with zero hesitation.

The unit tests validated that the PaymentRequest object serialized correctly. What they didn't validate was that the downstream payment processor actually accepted that serialization. We'd renamed card_number to cardNumber in our model, updated our serializer tests to expect cardNumber, and felt good about ourselves.

The payment service expected card_number. It had always expected card_number. Our integration with it was documented in a Confluence page last updated in 2019.

For 47 minutes, every payment failed silently. The payment service returned a 422, our error handler logged it as a "transient failure," and the retry logic dutifully retried the same malformed request three more times. Our monitoring dashboard showed a spike in payment retries, but we'd been ignoring that metric for weeks because of an unrelated flaky upstream.

A customer support agent Slacked the on-call at 5:23 PM: "Getting a lot of calls about failed orders." That's how we found out. Not from our alerting. Not from our dashboards. From a person answering a phone.

The Moment of Truth

I ran curl against the payment service with our new payload format and watched the 422 come back in real time. Then I ran it with the old field name and got a 200. The fix was a one-line change — literally renaming a field back. The deploy took 8 minutes. The damage took three weeks to unwind: refunds, customer apologies, a merchant trust review.

The Aftermath

We wrote integration tests. Not "next sprint" — that weekend. We added a contract test suite using Pact that validated every downstream API interaction. We also added a pre-deploy smoke test that hit the payment sandbox with a real payload. CI went from 4 minutes to 11 minutes. Nobody complained.

What didn't change fast enough was our culture around the phrase "the unit tests cover it." That took another incident, three months later, with the shipping provider API.

The Lessons

  1. Integration tests catch what unit tests can't: Unit tests validate your code in isolation. Integration tests validate that your code works with everything it touches. They're different tools for different failure modes.
  2. "We'll add tests later" means never: If it's not in the PR, it's not getting done. Treat missing integration tests as a blocking review comment.
  3. Test the integration points: Every external API call, every database query, every message queue interaction — these are the seams where production breaks.

What I'd Do Differently

I'd add a make contract-test target from day one that runs Pact or similar against every downstream dependency. I'd make it a required CI gate, not optional. I'd also set up a synthetic transaction monitor that executes a real end-to-end payment flow every 5 minutes in production and pages if it fails.

The Quote

"Our unit tests were a beautifully comprehensive map of a country that didn't exist."

Cross-References