That sounds perfectly reasonable, and 20 years ago I'd have said the same thing. It's even fairly likely that I did say it.
You know where this is going, right?
I was a SysAdmin at the time and one of the things I was responsible for was Oracle Financials. It was at the core of everything that mattered at that company -- and I was absolutely convinced that short of a water balloon fight in the machine room, my raid configuration was 100% reliable.
The first disk died late on Friday. I had two on-hand so I casually replaced it -- and before I got back to my desk another had died.
I placed an order for more drives and went back to the machine room and replaced it.
By Monday morning, things were starting to get serious -- I had to drop back to raid 5 because a few more had died over the weekend and my replacements wouldn't arrive till Tuesday.
You see, all the drives in all our raid enclosures had come from the same batch -- and they all -- every single one of them -- died within 90 days of the first one.
The chances may have been low, but reality has sharp teeth and loves the taste of overconfident sysadmin ass.
> The first disk died late on Friday. I had two on-hand so I casually replaced it -- and before I got back to my desk another had died.
That happens very very very often. It has nothing to do with a bad batch.
The second disk actually failed a while ago, but no one noticed because no one read from that part of it.
When you did the rebuild you read from the failed area and woke up the failure.
When you setup raid you MUST read the entire raid at least monthly, so that any errors are detected early! This is absolutely critical. Without that you have not installed the raid correctly. mdadm on debian does that by default. Linux has a builtin way to do that, but you must have a tool that will alert you on failure, or it's worthless.
You should also run a week full disk read of each hard disk in the array using its onboard long self test feature. You can use smartd to schedule that automatically. And more importantly: Notify you on failure.
Not using both tools is not setting up the raid correctly.
I think you have different definitions of failure. The previous person seems to claim it meant the disk was dead (i.e. no read works) while you seem to claim that it means an error caught by low level formatting.
Those scrubs do nothing to catch errors that the drives do not report such as misdirected writes. Consequently, there is no correct way to setup RAID in a way that makes dats safe. A check summing filesystem such as ZFS would handle this without a problem though.
> The previous person seems to claim it meant the disk was dead (i.e. no read works)
He made no such claim.
> while you seem to claim that it means an error caught by low level formatting.
A: There is no such thing as low level formatting in a modern drive, and B: No I don't. I said he should do a full disk read. Not format.
The SMART built in self-test does a full read of the drive, not write.
> Those scrubs do nothing to catch errors that the drives do not report such as misdirected writes.
That's only true with RAID 5. Ever other RAID level can compare disks and check that the data matches exactly. The Linux md software RAID does that automatically if you ask it to check things, and then it will report how many mismatches it found.
If you look he wrote: "I had to drop back to raid 5". He had a better level of RAID before, with multiple disk redundancy, that allows the RAID to check for mismatches and even correct them.
But because he never scheduled full disk reads the RAID never detected that many of the drives had problems.
> Consequently, there is no correct way to setup RAID in a way that makes dats safe.
That is not correct. The only advice I would give is avoid RAID 5. The other levels let you check for correctness.
> A check summing filesystem such as ZFS would handle this without a problem though.
Only if A: you actually run disk checks, B: and only if ZFS handles the RAID!!! ZFS on top of RAID will NOT detect such errors 50% of the time (randomly depending on which disk is read from).
Doing a full read causes every sector's ECC in the low level formatting to be checked. If something is wrong, you get a read error that can be corrected by RAID, ZFS or whatever else you are running on top of it provides redundancy. Without the ECC, the self test mechanism would be pointless as it would have no way to tell if the magnetic signals being interpreted are right or wrong.
As for other RAID levels catching things. With RAID 1 and only two mirrors, there is no way to tell which is right either. The same goes for RAID 10 with two mirrors and RAID 0+1 with two mirrors. You might be able to tell with RAID 6, but such things are assumed by users rather than guaranteed. RAID was designed around the idea that uncorrectable bit errors and drive failures are the only failure states. It is incapable of handling silent corruption in general and in the few cases where it might be able to handle it, whether it does is implementation dependent. RAID 6 also degrades to RAID 5 when a disk fails and there is no way for a patrol scrub to catch a problem that occurs after it and before the next patrol scrub. RAID will happily return incorrect data, especially since only 1 mirror member is read at a given time (for performance) and only the data blocks in RAID 5/6 are read (again for performance) unless there is a disk failure.
There is no reason to use RAID with ZFS. However, ZFS will always detect silent corruption in what it reads even if it is on top of RAID. It just is not guarenteed to be able to correct it. Maybe you got the idea of "ZFS on top of RAID will NOT detect such errors 50% of the time" from thinking of a two-disk mirror. If you are using ZFS on RAID instead of letting ZFS have the disks and that happens, you really only have yourself to blame.
1. install mdadm and configure it to run /usr/share/mdadm/checkarray every month. (The default on debian.)
2. have it run as a daemon constantly monitoring for problems. (Also the default on debian.)
3. test that it actually works by setting one of your raid devices faulty and making sure you get an instant email. A tool that detects a problem and can't tell you is quite useless.
4. install smartmontools and configure /etc/smartd.conf to run nightly short self tests, and weekly long self tests. Something like: /dev/sda -a -o on -S on -m email@domain.com -s (S/../.././02|L/../../6/03)
5. do a test of smartmontools by adding -M test to the line above to make sure it is able to contact you
This way you will find out about problems with the disk before they grow large.
There are other settings for smartmontools to monitor all the SMART attributes and you can tell it when to contact you.
6. Extra credit: Install munin or other system graphing tools and graph all the SMART attributes. Check it quarterly and look for anomalies. Everything should be flat except for Power_On_Hours.
If you use 3ware/LSI/Avago (now all Broadcom) you install the Megaraid tools and it has a scheduler that does a patrol read whenever you schedule it to. ReadyNAS devices have a similar setting that will check the disks.
I wonder, for such mission critical data a good strategy would be to start rotating out old drives at set periods. Maybe replace one drive of your RAID5 every 6 months, not matter what the health. Once you've replaced all the drives, they will all be staggered in age by 6 months. Hopefully then the chances of multiple failures is greatly reduced.
Replacing a drive in a RAID5 would mean voluntarily placing yourself in a high risk position every 6 months. If you're going to do that, you better make sure you use a raid level that gives you at least two drive redundancy.
That would make your failure rate higher (from increased infant mortaility) and your maintenance more expensive (more spares, more transaction costs, more work, replacing perfectly fine working parts too soon).
Drives have on-condition monitoring via SMART so you can predict age related failure.
Scheduled replacement is almost the worst maintenance policy.
For mission critical data, you should have backups and use at least triple mirrors or raidz2. Quadruple mirrors or raidz3 are even better for the extra paranoid.
That might have been the case with HDD, and the rationale of using RAID. But I thought that with SSD, if you have drives in RAID 1 or RAID 5, each drive makes the same amount of writes at the same time, and if they are the same model / age, the firmware will allocate the writes to the same cell. And you would end up with exactly the same wear on all cells. And therefore all drives failing simultaneously.