MongoDB and ZFS bad performance: disk always busy with reads while doing only writes
Solution 1:
First off, it's worth stating that ZFS is not a supported filesystem for MongoDB on Linux - the recommended filesystems are ext4 or XFS. Because ZFS is not even checked for on Linux (see SERVER-13223 for example) it will not use sparse files, instead attempting to pre-allocate (fill with zeroes), and that will mean horrendous performance on a COW filesystem. Until that is fixed adding new data files will be a massive performance hit on ZFS (which you will be trying to do frequently with your writes). While you are not doing that performance should improve, but if you are adding data fast enough you may never recover between allocation hits.
Additionally, ZFS does not support Direct IO, so you will be copying data multiple times into memory (mmap, ARC, etc.) - I suspect that this is the source of your reads, but I would have to test to be sure. The last time I saw any testing with MongoDB/ZFS on Linux the performance was poor, even with the ARC on an SSD - ext4 and XFS were massively faster. ZFS might be viable for MongoDB production usage on Linux in the future, but it's not ready right now.
Solution 2:
This may sound a bit crazy, but I support another application that benefits from ZFS volume management attributes, but does not perform well on the native ZFS filesystem.
My solution?!?
XFS on top of ZFS zvols.
Why?!?
Because XFS performs well and eliminates the application-specific issues I was facing with native ZFS. ZFS zvols allow me to thin-provision volumes, add compression, enable snapshots and make efficient use of the storage pool. More important for my app, the ARC caching of the zvol reduced the I/O load on the disks.
See if you can follow this output:
# zpool status
pool: vol0
state: ONLINE
scan: scrub repaired 0 in 0h3m with 0 errors on Sun Mar 2 12:09:15 2014
config:
NAME STATE READ WRITE CKSUM
vol0 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
scsi-SATA_OWC_Mercury_AccOW140128AS1243223 ONLINE 0 0 0
scsi-SATA_OWC_Mercury_AccOW140128AS1243264 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
scsi-SATA_OWC_Mercury_AccOW140128AS1243226 ONLINE 0 0 0
scsi-SATA_OWC_Mercury_AccOW140128AS1243185 ONLINE 0 0 0
ZFS zvol, created with: zfs create -o volblocksize=128K -s -V 800G vol0/pprovol
(note that auto-snapshots are enabled)
# zfs get all vol0/pprovol
NAME PROPERTY VALUE SOURCE
vol0/pprovol type volume -
vol0/pprovol creation Wed Feb 12 14:40 2014 -
vol0/pprovol used 273G -
vol0/pprovol available 155G -
vol0/pprovol referenced 146G -
vol0/pprovol compressratio 3.68x -
vol0/pprovol reservation none default
vol0/pprovol volsize 900G local
vol0/pprovol volblocksize 128K -
vol0/pprovol checksum on default
vol0/pprovol compression lz4 inherited from vol0
vol0/pprovol readonly off default
vol0/pprovol copies 1 default
vol0/pprovol refreservation none default
vol0/pprovol primarycache all default
vol0/pprovol secondarycache all default
vol0/pprovol usedbysnapshots 127G -
vol0/pprovol usedbydataset 146G -
vol0/pprovol usedbychildren 0 -
vol0/pprovol usedbyrefreservation 0 -
vol0/pprovol logbias latency default
vol0/pprovol dedup off default
vol0/pprovol mlslabel none default
vol0/pprovol sync standard default
vol0/pprovol refcompressratio 4.20x -
vol0/pprovol written 219M -
vol0/pprovol snapdev hidden default
vol0/pprovol com.sun:auto-snapshot true local
Properties of ZFS zvol block device. 900GB volume (143GB actual size on disk):
# fdisk -l /dev/zd0
Disk /dev/zd0: 966.4 GB, 966367641600 bytes
3 heads, 18 sectors/track, 34952533 cylinders
Units = cylinders of 54 * 512 = 27648 bytes
Sector size (logical/physical): 512 bytes / 131072 bytes
I/O size (minimum/optimal): 131072 bytes / 131072 bytes
Disk identifier: 0x48811e83
Device Boot Start End Blocks Id System
/dev/zd0p1 38 34952534 943717376 83 Linux
XFS information on ZFS block device:
# xfs_info /dev/zd0p1
meta-data=/dev/zd0p1 isize=256 agcount=32, agsize=7372768 blks
= sectsz=4096 attr=2, projid32bit=0
data = bsize=4096 blocks=235928576, imaxpct=25
= sunit=32 swidth=32 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=65536, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
XFS mount options:
# mount
/dev/zd0p1 on /ppro type xfs (rw,noatime,logbufs=8,logbsize=256k,nobarrier)
Note: I also do this on top of HP Smart Array hardware RAID in some cases.
The pool creation looks like:
zpool create -o ashift=12 -f vol1 wwn-0x600508b1001ce908732af63b45a75a6b
With the result looking like:
# zpool status -v
pool: vol1
state: ONLINE
scan: scrub repaired 0 in 0h14m with 0 errors on Wed Feb 26 05:53:51 2014
config:
NAME STATE READ WRITE CKSUM
vol1 ONLINE 0 0 0
wwn-0x600508b1001ce908732af63b45a75a6b ONLINE 0 0 0
Solution 3:
We were looking into running Mongo on ZFS and saw that this post raised major concerns about the performance available. Two years on we wanted to see how new releases of Mongo that use WiredTiger over mmap, performed on the now officially supported ZFS that comes with the latest Ubuntu Xenial release.
In summary it was clear that ZFS doesn't perform quite as well as EXT4 or XFS however the performance gap isn't that significant, especially when you consider the extra features that ZFS offers.
I've made a blog post about our findings and methodology. I hope you find it useful!
Solution 4:
I believe your disk is busy doing reads because of the
zfs_arc_max=2147483648
setting. Here you are explicitly limiting the ARC to 2Gb, even though you have 16-32Gb. ZFS is extremely memory-hungry and zealous when it comes to the ARC. If you have non-ZFS replicas identical to ZFS replicas (HW RAID1 underneath), doing some maths yields
5s spike @ (200Mb/s writes (estimated 1 hdd throughput) * 2 (RAID1)) = 2Gb over 5sec
which means you are probably invalidating the whole ARC cache in 5 seconds time. ARC is (to some degree) "intelligent" and will try to retain both the most recently written blocks and the most used ones, so your ZFS volume may well be trying to provide you a decent data cache with the limited space it has. Try raising zfs_arc_max to half of your RAM (or even more) and using arc_shrink_shift to more aggressively evict ARC cache data.
Here you can find a 17-part blog reading for tuning and understanding ZFS filesystems.
Here you can find the ARC shrink shift setting explaination (first paragraph), which will allow you to reclaim more ARC RAM upon eviction and keep it under control.
I'm unsure of the reliability of the XFS on zvol solution. Even though ZFS is COW, XFS is not. Suppose that XFS is updating its metadata and the machine loses power. ZFS will read the last good copy of the data thanks to the COW feature, but XFS won't know of that change. Your XFS volume may remain "snapshotted" to the version before the power failure for an half, and to the version after power failure for the other (because it's not known to ZFS that all that 8Mb write has to be atomic and contains inodes only).
[EDIT] arc_shrink_shift and other parameters are available as module parameters for ZFSonlinux. Try
modinfo zfs
to get all the supported ones for your configuration.