Why do my SSD read latency benchmarks get markedly worse when I put an XFS filesystem on top?
Most modern SSDs use a page-based mapping table. At first (or after a complete TRIM/UNMAP) the mapping table is empty - ie any LBA returns 0, even if the underlying flash page/block is not completely erased and so its actual value is different than a plain 0.
This means that, after a complete blkdiscard
, you are not reading from the flash chip themselves; rather, the controller immediately returns 0 to all your reads. This easily explain your findings.
Some more ancient SSDs use different, less efficient but simpler approaches which always reads from the NAND chip themselves. On such drives the value of a trimmed page/block is sometime undefined, due to the controller not simply marking them as "empty" but rather reading from the NAND each time.
Yes, SSDs are more complex beast that "plain" HDDs: after all, they basically are small, auto-contained, thinly provisioned RAID volumes with their own filesystem/volume management called FTL (flash translation layer).