Is there a well known tool to check for this?

ars · on March 4, 2016

If you are using software raid on linux then:

1. install mdadm and configure it to run /usr/share/mdadm/checkarray every month. (The default on debian.)

2. have it run as a daemon constantly monitoring for problems. (Also the default on debian.)

3. test that it actually works by setting one of your raid devices faulty and making sure you get an instant email. A tool that detects a problem and can't tell you is quite useless.

4. install smartmontools and configure /etc/smartd.conf to run nightly short self tests, and weekly long self tests. Something like: /dev/sda -a -o on -S on -m email@domain.com -s (S/../.././02|L/../../6/03)

5. do a test of smartmontools by adding -M test to the line above to make sure it is able to contact you

This way you will find out about problems with the disk before they grow large.

There are other settings for smartmontools to monitor all the SMART attributes and you can tell it when to contact you.

6. Extra credit: Install munin or other system graphing tools and graph all the SMART attributes. Check it quarterly and look for anomalies. Everything should be flat except for Power_On_Hours.

ansible · on March 4, 2016

For btrfs, you'll want to schedule a scrub on a regular basis. This will also detect read errors and try to fix them.

pixl97 · on March 4, 2016

If you use 3ware/LSI/Avago (now all Broadcom) you install the Megaraid tools and it has a scheduler that does a patrol read whenever you schedule it to. ReadyNAS devices have a similar setting that will check the disks.