Postmortem: SSD Firmware Bug Causes Silent Bit Corruption¶
| Field | Value |
|---|---|
| ID | PM-019 |
| Date | 2025-09-14 |
| Severity | SEV-3 |
| Duration | 3h (detection to confirmed scope) |
| Time to Detect | 0m (detected by scheduled scrub, not real-time alert) |
| Time to Mitigate | 3h (scope confirmed; affected drives quarantined and rebuild initiated) |
| Customer Impact | None — corrupted data was cold archive/backup data not serving live traffic |
| Revenue Impact | None |
| Teams Involved | Datacenter Operations, Storage Engineering, Infra SRE, Backup & Recovery |
| Postmortem Author | Leonard Achebe |
| Postmortem Date | 2025-09-17 |
Executive Summary¶
On 2025-09-14, the weekly scheduled ZFS scrub on the arc-pool-01 through arc-pool-24 storage nodes detected checksum mismatches on 3 of 24 nodes — all three hosting Intel DC S4610 SSDs with firmware revision 007. Investigation confirmed that the corruption matched a known Intel firmware advisory (SA-00489) describing silent bit-flip corruption on drives exceeding approximately 40,000 power-on hours. The RAID controller on all three nodes reported no errors because it performs only parity checks, not end-to-end data integrity verification. The corrupted data was exclusively cold archive and backup data; no hot data serving live traffic was affected. Affected drives were quarantined, firmware updates were applied to all remaining SA-00489-affected drives, and ZFS resilver operations were initiated to rebuild the degraded vdevs from healthy mirrors. No backup data was found to be unrestorable from alternate copies.
Timeline (All times UTC)¶
| Time | Event |
|---|---|
| 02:00 | Weekly ZFS scrub job starts automatically on all 24 storage nodes (arc-pool-01 through arc-pool-24) via cron |
| 04:17 | ZFS scrub completes on arc-pool-07; zpool status reports 1 checksum error on vdev sda — flagged in scrub completion log |
| 04:31 | ZFS scrub completes on arc-pool-12; zpool status reports 3 checksum errors on vdev sdb |
| 05:02 | ZFS scrub completes on arc-pool-19; zpool status reports 7 checksum errors on vdev sda |
| 06:00 | Automated scrub report aggregation job runs; generates summary report sent to storage-ops@meridian.internal DL |
| 06:14 | Storage engineer Wanjiru Kamau reads the scrub report during morning triage; notes 3 nodes with checksum errors — unusual; baseline is 0 errors per week |
| 06:22 | Wanjiru SSHes to arc-pool-07; runs zpool status -v to inspect error details; confirms checksum errors on drive serial S4610-2307-A8F4 |
| 06:28 | Wanjiru checks drive model and firmware: smartctl -i /dev/sda — Intel DC S4610, firmware 007, power-on hours: 41,204 |
| 06:35 | Wanjiru searches internal hardware advisory tracker for "S4610 firmware 007"; finds no entry — searches public Intel advisories |
| 06:41 | Wanjiru locates Intel SA-00489 (published 2025-07-10): describes silent data corruption on DC S4610 drives with firmware revision XCV10070 (internal: 007) after approximately 40,000 power-on hours |
| 06:44 | Wanjiru pages Leonard Achebe (Storage Engineering lead) and Infra SRE on-call Brenda Okafor |
| 06:55 | Leonard joins; confirms the advisory match; initiates scope assessment: how many drives in the fleet match model + firmware + power-on hours criteria? |
| 07:10 | Fleet scan script runs across all storage nodes: 31 of 288 drives are Intel DC S4610 with firmware 007; of those, 14 have power-on hours > 40,000 |
| 07:25 | Leonard and Brenda determine which pools and datasets are hosted on affected drives; cross-reference with data classification inventory |
| 07:48 | Confirmed: all 14 at-risk drives host cold-tier data only (backup snapshots, log archives, compliance exports); no hot-tier or customer-serving data on affected drives |
| 08:12 | Decision made: quarantine the 3 drives with confirmed errors immediately; schedule controlled firmware update + drive replacement for remaining 11 at-risk drives within 48 hours |
| 08:20 | arc-pool-07, arc-pool-12, arc-pool-19 taken offline for quarantine; ZFS resilver initiated from mirror copies on healthy drives |
| 08:45 | Backup & Recovery team (Ignacio Peralta) verifies that all data on corrupted vdevs has an intact alternate copy; no unrestorable data identified |
| 09:00 | Resilver in progress on all 3 pools; estimated completion 6–8 hours; no data at risk (mirror copies intact) |
| 09:05 | Scope confirmed; incident declared contained; action items scoped for firmware remediation of remaining 11 drives |
| 09:10 | Postmortem scheduled for 2025-09-17 |
Impact¶
Customer Impact¶
None. All corrupted data was classified as cold-tier: backup snapshots and compliance log archives that are not served to any live application or customer-facing system. The corruption was detected by ZFS checksums during the scrub; no application attempted to read the corrupted blocks during the window between corruption onset and detection. Customer-facing services ran on separate hot-tier storage pools with different drive models.
Internal Impact¶
- 3 storage nodes (
arc-pool-07,arc-pool-12,arc-pool-19) taken offline for resilver; cold-tier data on those pools inaccessible for 6–8 hours during resilver (no production dependency on cold-tier data for that window) - Wanjiru Kamau: ~3 hours of investigation and scope assessment
- Leonard Achebe: ~4 hours of incident lead, scope determination, and remediation planning
- Brenda Okafor (Infra SRE): ~2.5 hours
- Ignacio Peralta (Backup & Recovery): ~1.5 hours for backup integrity verification
- Firmware update and drive replacement program for 11 remaining at-risk drives: estimated 3–4 days of Datacenter Operations work
- Hardware advisory monitoring process review: 2-day effort scheduled for the Hardware team
Data Impact¶
3 drives had confirmed ZFS checksum errors. ZFS reports the checksum mismatch and refuses to serve the corrupted blocks — it does not silently return corrupted data to the application. Because all affected vdevs had intact mirror copies, ZFS was able to self-heal (correct the corrupted blocks from the mirror) during the scrub itself. Post-scrub verification confirmed zero persistent data loss. The 7 corrupted blocks on arc-pool-19 (the worst-affected node) were corrected from mirror and the drive was subsequently quarantined.
Root Cause¶
What Happened (Technical)¶
Intel DC S4610 SSDs with internal firmware revision XCV10070 (reported as 007 by smartctl) contain a firmware defect in the wear-leveling algorithm that, after approximately 40,000 power-on hours, can cause write operations to silently flip individual bits in data that has not been recently accessed (cold data). The corruption is silent from the perspective of the storage controller: the RAID controller that manages the physical drives performs only RAID parity checks (which verify inter-drive consistency) and does not independently verify the integrity of data returned by each drive. When the drive returns a bit-flipped block, the RAID controller accepts it as valid.
Intel published security advisory SA-00489 on 2025-07-10, approximately two months before this incident, recommending firmware updates to all affected drives and an immediate audit of drives exceeding 40,000 power-on hours. The advisory was not ingested into Meridian Technologies' internal hardware advisory tracking system because there was no automated feed from Intel's advisory RSS/API into the tracker — advisories were manually reviewed by a hardware engineer on an ad-hoc basis, and this one was missed.
ZFS detected the corruption because it stores a cryptographic checksum (SHA-256 by default for data blocks on this configuration) for every block written. When the scrub read the corrupted blocks, ZFS computed the checksum, compared it against the stored checksum, found a mismatch, and flagged the error. ZFS then read the corresponding blocks from the mirror copy, verified their checksum, confirmed them good, and corrected the corrupted drive — all without any data being lost or served incorrectly to an application.
The corruption affected only cold data because the firmware defect manifests in the wear-leveling code that operates on blocks that have not been recently rewritten. Hot data (frequently accessed and overwritten) does not trigger the defect because the wear-leveling cycle that contains the bug is only reached for blocks that have been stable for extended periods.
Contributing Factors¶
-
No automated hardware firmware advisory monitoring: Intel SA-00489 was published in July 2025. Meridian's hardware team had no automated mechanism to ingest, parse, or alert on vendor security advisories. The advisory was missed for two months. An automated feed that cross-referenced advisory affected models against fleet inventory would have flagged the 31 at-risk drives within days of publication.
-
RAID controller's "all clear" provides false confidence: The RAID controllers on these nodes (LSI MegaRAID SAS 9361-8i) report drive health based on S.M.A.R.T. data and RAID parity. Neither mechanism detects intra-drive silent bit corruption. Engineers and automated health checks trusted the controller's "all drives healthy" report without understanding that it does not provide end-to-end data integrity guarantees.
-
No automated firmware compliance checking: The fleet had no tooling to continuously assert that drives of a given model are running approved firmware versions. A simple inventory check that flags
model=S4610 AND firmware=007 AND power_on_hours>40000against a known-bad list would have surfaced these drives immediately after the advisory was ingested.
What We Got Lucky About¶
-
ZFS checksums detected the corruption that the RAID controller could not see. If these volumes had been formatted with ext4 or XFS (which do not perform block-level checksumming by default), the corruption would have been completely invisible — neither to the storage stack nor to applications that read the data. The corrupted bytes would have been returned as valid data, potentially poisoning backups and archives silently until a restore attempt revealed the problem, at which point the data might have been long overwritten.
-
The corrupted data was exclusively on cold-tier drives. The same Intel DC S4610 drives were not used in the hot-tier storage pools (those use a different model). Had the hot-tier pools been affected, application data serving live traffic could have been corrupted.
-
All affected vdevs had an intact mirror. ZFS RAID-Z configurations always have at least one redundancy copy, and in this case the mirrors were on drives of a different production batch (different manufacturing date, slightly different power-on hour count) that had not yet crossed the 40,000-hour threshold. A correlated failure (two mirrors in the same vdev both crossing 40K hours simultaneously) would have resulted in unrecoverable data loss.
Detection¶
How We Detected¶
Detection was via the weekly scheduled ZFS scrub, which is the standard ZFS maintenance tool for proactively identifying checksum errors before they accumulate into data loss. The scrub runs every Sunday at 02:00 UTC. The scrub completion report is aggregated and emailed to the storage ops distribution list. Wanjiru identified the anomaly during morning triage at 06:14 UTC, approximately 4 hours after the first error was flagged.
Why We Didn't Detect Sooner¶
Real-time detection of this class of error was not possible — ZFS checksum errors are detected only when blocks are read (either during a scrub or during a normal read). The weekly scrub frequency means that corruption that occurred between scrubs could sit undetected for up to 7 days. The advisory was available 2 months before this incident; had it been ingested promptly, the drives would have been patched or quarantined before the corruption occurred.
Response¶
What Went Well¶
- Wanjiru's immediate cross-referencing of the drive model and firmware against Intel's advisory database was methodical and fast — root cause was identified in under 15 minutes of investigation.
- The fleet scan script (which Wanjiru ran to enumerate all affected drives) was already written and maintained by the Storage Engineering team as part of routine fleet inventory tooling — it required only a new filter criteria to be immediately useful.
- The data classification inventory (which maps storage pools to data tiers) was accurate and up-to-date, enabling a rapid and confident determination that no hot-tier data was at risk. This directly informed the decision to treat the incident as SEV-3 rather than escalating to SEV-2.
- ZFS's self-healing behavior (correcting corrupted blocks from mirror during scrub) meant that data integrity was restored automatically during the detection event — no manual data restoration was required.
What Went Poorly¶
- A two-month delay between advisory publication and detection is unacceptable for a critical firmware advisory. The hardware advisory monitoring process is entirely manual and reactive.
- The RAID controller's health status was being surfaced in automated health dashboards without any annotation that it does not guarantee end-to-end data integrity. Operations engineers reasonably trusted the "all drives healthy" indicator without understanding its limitations.
- The incident was not escalated to SEV-2 initially because the scope assessment (hot-tier vs. cold-tier impact) took 53 minutes to complete. A more rapid data classification lookup capability would have enabled faster severity determination.
Action Items¶
| ID | Action | Priority | Owner | Status | Due Date |
|---|---|---|---|---|---|
| AI-019-01 | Implement automated hardware firmware advisory feed: subscribe to Intel, Samsung, and Seagate advisory RSS/APIs; cross-reference affected models against fleet inventory daily; alert on matches | Critical | Leonard Achebe | Open | 2025-10-01 |
| AI-019-02 | Deploy firmware update to all 11 remaining Intel DC S4610 drives with firmware 007 and power-on hours > 35,000; schedule replacement for drives > 40,000 hours |
Critical | Datacenter Operations | Open | 2025-09-21 |
| AI-019-03 | Add firmware compliance check to weekly fleet health report: flag any drive model + firmware combination that matches a known-advisory entry | High | Leonard Achebe | Open | 2025-10-08 |
| AI-019-04 | Update monitoring dashboards: annotate RAID controller health indicators with caveat "parity-only, no end-to-end integrity guarantee"; add separate ZFS checksum error rate metric to storage health dashboard | High | Brenda Okafor | Open | 2025-09-28 |
| AI-019-05 | Reduce ZFS scrub frequency on high-risk drives (models with known wear-level advisories) to daily during the remediation window | Medium | Wanjiru Kamau | Open | 2025-09-20 |
| AI-019-06 | Evaluate increasing ZFS scrub frequency cluster-wide from weekly to every 3 days; assess performance impact on cold-tier workloads | Low | Leonard Achebe | Open | 2025-10-15 |
Lessons Learned¶
-
Filesystem-level checksumming is not redundant with RAID: RAID controllers verify inter-drive consistency (parity) but do not detect intra-drive bit corruption. Only filesystems or applications that store and verify block-level checksums (ZFS, Btrfs, application-layer checksums) can catch the class of silent corruption that RAID misses. This distinction must be part of every storage architecture decision.
-
Vendor security advisories are threat intelligence for hardware: Software security teams have mature tooling for tracking CVEs and patching vulnerable software. Hardware firmware advisories require the same discipline — automated ingestion, fleet cross-referencing, and SLA-driven remediation. Treating firmware advisories as optional reading is equivalent to ignoring software CVEs.
-
Cold data is not low-risk data: Archives and backups are often given less operational attention than hot-tier data because they do not serve live traffic. However, the value of backup data is revealed only at restore time — when it is too late to discover that it was silently corrupted. Cold data integrity must be proactively verified, not assumed.
Cross-References¶
- Failure Pattern: Silent data corruption / firmware defect; advisory monitoring gap; RAID controller false confidence
- Topic Packs: Storage integrity, ZFS, RAID limitations, hardware lifecycle management, firmware advisory management
- Runbook:
runbooks/storage/zfs-scrub-error-response.md - Decision Tree: Triage → ZFS checksum errors in scrub report →
smartctl -ifor model/firmware/power-on hours → cross-reference against advisory tracker → scope assessment (hot vs. cold tier) → quarantine affected drives → initiate resilver from mirror → firmware update program for remaining at-risk drives