Do I need a second RAID controller for fault-tolerance?
Solution 1:
In a 'single box high availability' design then yes, you'd want a second controller, ideally on a second bus too. But this kind of approach has given way to a cheaper design based around clustering where one box failure doesn't stop service. So it depends on if you plan to use a clustered environment or rely on a single box. Even if your answer is the latter having dual controllers may be seen as adding extra complexity and maybe being overkill.
edit - based on your comment about using ESXi on your other question I'd have to say that its clustering is fabulous, we have many 32-way clusters that work brilliantly.
Solution 2:
A second RAID controller which is actively used is not for redundancy. Only if it is a cold-stand-by controller where you switch all your disks to when the first one dies. Then you have redundancy (for the controller). But beware of doing so, as posted here.
So the RAID is for redundancy of disks leading to a single point of failure at the controller. Having a second (unused) controller may solve this as you could switch all the disk to the new one. If this works depends on other factors...
I'm no native speaker, but for me "fault-tolerance" is something different than "redundancy". Can some English speaker help me out here?
Solution 3:
On a single box, you actually need two RAID controllers, connected to two different PCI-E root complexes, to have complete I/O subsystem redundancy. This can be achieved by two different configuration:
- use costly dual ported SAS disks, with each SAS link connected to a different controller. In this manner, each controller is connected to each disk. Obviously, the two controller can't operate on disks at the same time; some form of locking/fence is necessary to coordinate access to disks. SCSI has some special provision to provide the necessary fencing mechanism, but these must be coordinated by appropriate software. In other words, you can not simply connect a disk to two controller and call it a day; rather, you need appropriate software configuration to let it works without problems;
- use normal and cheaper single link SAS/SATA disks, connecting one half of them to each controller. For example with 6 disks, you need to connect 3 disks to a controller and 3 disks to another controller. On each controller, configure a RAID array as needed (eg: RAID 5 or RAID1). Then, at the OS level, you can configure a software RAID between the two disk arrays, achieving full array redundancy. While cheaper, this solution has the added drawback to effectively halve your storage capacity (due the the software RAID1 level).
A key problem with both approach is that you do not have full system redundancy: a motherboard/CPU problem can bring down the entire system, independently from how much controllers/disks you have.
For this reason, this kind of redundancy-in-a-box is seldom used lately (apart that in mid/high-end SAN deployments); rather, clustering/network mirroring is gaining wide traction. With clustering (or network mirroring) you have full system redundancy, as a single failed system can not negate data access. Obviously clustering has its own pitfalls so its not a silver/easy bullet, but in some situation its advantages can not be negated. Moreover, you can also use asynchronous network mirroring to have an almost-realtime data redundacy on geographically different location, so that a single catastrophic event will not wreak havoc on your data.