Building predictive drive failure detection with S.M.A.R.T. data

NodeFoundry polls S.M.A.R.T. attributes from every drive in your cluster every 15 minutes. Here’s what we do with that data.

What S.M.A.R.T. actually tells you

Modern drives expose hundreds of vendor-specific attributes. A handful are universally predictive of imminent failure:

Reallocated Sector Count — sectors remapped due to read errors. Any non-zero trend is worth watching.
Uncorrectable Error Count — errors the drive couldn’t fix internally.
Wear Leveling Count (NVMe) — indicates remaining write endurance.
CRC Error Count — usually indicates a cable or controller problem, not the drive itself.

How we surface warnings

Rather than alerting on raw attribute values (which vary wildly by drive model), we alert on rate of change over a rolling 30-day window. A reallocated sector count of 12 is less interesting than a reallocated sector count that went from 0 to 12 in a week.

When NodeFoundry detects a degrading trend, it:

Creates a drive_health_warn alert in your cluster timeline
Links directly to the affected OSD so you can reweight it before replacing the drive
Logs the full S.M.A.R.T. dump to the node’s audit trail

Limitations

S.M.A.R.T. is not a crystal ball. Drives fail without warning — especially SSDs, which can fail suddenly at end-of-life. We recommend treating S.M.A.R.T. warnings as a reason to schedule replacement, not to defer it.

Building predictive drive failure detection with S.M.A.R.T. data

Building predictive drive failure detection with S.M.A.R.T. data

What S.M.A.R.T. actually tells you

How we surface warnings

Limitations

We're happy to walk you through it.