To fsck or not fsck after 180 days

Solution 1:

The 180-day default fsck time is a workaround for the design flaw that ext3 does not support an online consistency check. The real solution is to find a filesystem that supports this. I don't know if any mature filesystem does. It's a real tragedy. Perhaps btrfs will save us one day.

I've responded to the issue of the surprise multi-hour downtime from fsck by doing scheduled reboots with a full fsck as part of standard maintenance. This is better than running into minor corruption during production hours, and having it turn into a real outage.

A big part of the problem is that ext3 has an unreasonably slow fsck. Although xfs has a much faster fsck, it uses too much memory for distributions to encourage xfs by default on large filesystems. Still, on most systems this is a non-issue. Switching to xfs would at least allow for a reasonably fast fsck. This may make running fsck as part of normal maintenance easier to schedule.

If you're running RedHat and considering using xfs, you have to beware of how strongly they discourage the use of xfs and the fact that there are probably few people using xfs on the kernel you're running.

My understanding is that the ext4 project has a goal of at least somewhat improving the fsck performance.

Solution 2:

I would say that this is just another reason for which production servers should not run all alone and always have either a hot/cold backup or take part in a two node cluster. In these days of virtualization, you can easily have a physical main server and a virtual server, which is only a copy of the physical done every X days, ready to take over.

Other then this not so helpful answer, I would say that you should balance the importance of your data... If this is just a cluster node, skip it. If this is a client's non backuped web server, you may want to plan ahead next time :-)


Solution 3:

Depends.. For instance we had one server go down for routine maintenance that was running a QMail stack. QMail creates and kills a lot of files as time goes on, and it was a very busy mail server. The fsck took some 36 hours. It's not like we saved a helluva lot of performance out of the deal, but ultimately I suppose you could argue the filesystem was healthier. Was it really worth the chaos that ensued though? Not. At. All.