Saturday, March 10, 2007

Lessons Learned: Google Report on Hard Drives

The other day, several Googlers released a white paper detailing the experiences they have had with hard drive failures in our datacenter machines. According to the report, there hasn't been a good study on hard disk lifespans in a really large population, so the decision was made to collect this data from the large number of machines that Google has in service. The study reached several conclusions, some of them surprising:
  • There was no consistent pattern of disk failure associated with high temperature or increased disk activity.
  • Some SMART error signals are well-correlated with impending drive failure, including scan errors, reallocation errors, and probational counts; drives that reported a scan error were 39 times more likely to fail within 60 days.
  • However, other SMART error signals have only weak correlations with failure, namely seek errors and CRC errors; over 72% of all drives reported at least one seek error.
  • A majority of the failed drives (56%) reported none of the aforementioned well-correlated errors, and a large fraction (36%) reported no SMART errors whatsoever.
As someone who has had a few drives go bad over the years, I found this very interesting. My key takeaway: certain SMART error signals (not all) serve as a valuable warning, but you can't count on SMART to tell you when your drive is about to fail.

Lots more data and details in the full paper.

No comments: