Simple mdadm RAID 1 not activating spare
Doing this simply chucks the drive into the array without actually doing anything with it, i.e. it is a member of the array but not active in it. By default, this turns it into a spare:
sudo mdadm /dev/md0 --add /dev/sdb1
If you have a spare, you can grow it by forcing the active drive count for the array to grow. With 3 drives and 2 expected to be active, you would need to increase the active count to 3.
mdadm --grow /dev/md0 --raid-devices=3
The raid array driver will notice that you are "short" a drive, and then look for a spare. Finding the spare, it will integrate it into the array as an active drive. Open a spare terminal and let this rather crude command line run in it, to keep tabs on the re-sync progress. Be sure to type it as one line or use the line break (\) character, and once the rebuild finishes, just type Ctrl-C in the terminal.
while true; do sleep 60; clear; sudo mdadm --detail /dev/md0; echo; cat /proc/mdstat; done
Your array will now have two active drives that are in sync, but because there are not 3 drives, it will not be 100% clean. Remove the failed drive, then resize the array. Note that the --grow
flag is a bit of a misnomer - it can mean either grow or shrink:
sudo mdadm /dev/md0 --fail /dev/{failed drive}
sudo mdadm /dev/md0 --remove /dev/{failed drive}
sudo mdadm --grow /dev/md0 --raid-devices=2
With regard to errors, a link problem with the drive (i.e. the PATA/SATA port, cable, or drive connector) is not enough to trigger a failover of a hot spare, as the kernel typically will switch to using the other "good" drive while it resets the link to the "bad" drive. I know this because I run a 3-drive array, 2 hot, 1 spare, and one of the drives just recently decided to barf up a bit in the logs. When I tested all the drives in the array, all 3 passed the "long" version of the SMART test, so it isn't a problem with the platters, mechanical components, or the onboard controller - which leaves a flaky link cable or a bad SATA port. Perhaps this is what you are seeing. Try switching the drive to a different motherboard port, or using a different cable, and see if it improves.
A follow-up: I completed my expansion of the mirror to 3 drives, failed and removed the flaky drive from the md array, hot-swapped the cable for a new one (the motherboard supports this) and re-added the drive. Upon re-add, it immediately started a re-sync of the drive. So far, not a single error has appeared in the log despite the drive being heavily used. So, yes, drive cables can go flaky.
I've had exactly the same problem, and in my case I've found out that the active raid disk suffered from read-errors during synchronization. Therefore the new disk was newer successfully synchronized and therefore was kept marked as spare.
You might want to check your /var/log/messages and other system logs for errors.
Additionally, it might also be a good idea to check your disk's SMART status:
1) Run the short test:
"smartctl -t short /dev/sda"
2) Display the test results:
"smartctl -l selftest /dev/sda"
In my case this returned something like this:
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
1 Extended offline Completed: read failure 90% 7564 27134728
2 Short offline Completed: read failure 90% 7467 1408449701
I had to boot a live distro and manually copy the data from the defective disk to the new (currently "spare") one.
I had exactly the same problem and always thought that my second disk, which I wanted to re-add to the array had errors. But it was my original disk had read errors.
You could check it with smartctl -t short /dev/sdX
and see the results a few minutes later with smartctl -l selftest /dev/sdX
. For me it looked like this:
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 20% 25151 734566647
I tried to fix them with this manual . That was fun :-). I know you have checked both disks for errors, but I think your problem is, that the disk which is still in the md array has read errors, so adding a second disk fails.
Update
You should additional run a smartctl -a /dev/sdX
If you see Current_Pending_Sector > 0 something is wrong
197 Current_Pending_Sector 0x0012 098 098 000 Old_age Always - 69
For me it was definitely the problem that I removed a disk from raid just for testing and resyncing could not be done because of read failures. The sync aborted half the way. When I checked my disk which was still in the raid array smartctl reported problems.
I could fix them with the manual above and saw the number of pending sectors reduced. But there were to many and it is a long and boring procedure so I used my backup and restored the data on a different server.
As you didn't had the opportunity to use SMART, I guess your self test did not show up those broken sectors.
For me it is a lesson learned: Check your disks before you remove one from your array.