Building predictive drive failure detection with S.M.A.R.T. data
How we process S.M.A.R.T. attributes across thousands of drives to surface actionable warnings before a failure impacts cluster health.
Building predictive drive failure detection with S.M.A.R.T. data
NodeFoundry polls S.M.A.R.T. attributes from every drive in your cluster every 15 minutes. Here’s what we do with that data.
What S.M.A.R.T. actually tells you
Modern drives expose hundreds of vendor-specific attributes. A handful are universally predictive of imminent failure:
- Reallocated Sector Count — sectors remapped due to read errors. Any non-zero trend is worth watching.
- Uncorrectable Error Count — errors the drive couldn’t fix internally.
- Wear Leveling Count (NVMe) — indicates remaining write endurance.
- CRC Error Count — usually indicates a cable or controller problem, not the drive itself.
How we surface warnings
Rather than alerting on raw attribute values (which vary wildly by drive model), we alert on rate of change over a rolling 30-day window. A reallocated sector count of 12 is less interesting than a reallocated sector count that went from 0 to 12 in a week.
When NodeFoundry detects a degrading trend, it:
- Creates a
drive_health_warnalert in your cluster timeline - Links directly to the affected OSD so you can reweight it before replacing the drive
- Logs the full S.M.A.R.T. dump to the node’s audit trail
Limitations
S.M.A.R.T. is not a crystal ball. Drives fail without warning — especially SSDs, which can fail suddenly at end-of-life. We recommend treating S.M.A.R.T. warnings as a reason to schedule replacement, not to defer it.
Want to see it for yourself?
We're happy to walk you through it.
No pitch deck. Just a real conversation about your infrastructure, your cluster size, and whether NodeFoundry is the right fit. If it's not, we'll tell you.