Why is sequentially reading a large file row by row with mmap and madvise sequential slower than fgets?

POSIX_MADV_SEQUENTIAL is only a hint to the system and may be completely ignored by a particular POSIX implementation.

The difference between your two solutions is that mmap requires the file to be mapped into the virtual address space entierly, whereas fgets has the IO entirely done in kernel space and just copies the pages into a buffer that doesn't change.

This also has more potential for overlap, since the IO is done by some kernel thread.

You could perhaps increase the perceived performance of the mmap implementation by having one (or more) independent threads reading the first byte of each page. This (or these) thread then would have all the page faults and the time your application thread would come at a particular page it would already be loaded.


Reading the man pages of mmap reveals that the page faults could be prevented by adding MAP_POPULATE to mmap's flags:

MAP_POPULATE (since Linux 2.5.46): Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.

This way a page faulting pre-load thread (as suggested by Jens) will become obsolete.

Edit: First of all the benchmarks you make should be done with the page cache flushed to get meaningful results:

    echo 3 | sudo tee /proc/sys/vm/drop_caches

Additionally: The MADV_WILLNEED advice with madvise will pre-fault the required pages in (same as the POSIX_FADV_WILLNEED with fadvise). Currently unfortunately these calls block until the requested pages are faulted in, even if the documentation tells differently. But there are kernel patches underway which queue the pre-fault requests into a kernel work-queue to make these calls asynchronous as one would expect - making a separate read-ahead user space thread obsolete.


I'm not an expert so I will just share what I do know, maybe it will help you.

What you're doing - reading through the entire mmap space - is supposed to trigger a series of page faults. with mmap, the OS only lazily loads pages of the mmap'd data into memory (loads them when you access them). With mmap, the OS can also easily purge unchanged pages to free up memory, and will only write back parts pages that have been modified. So this lazy, memory mapping approach is an optimization. Although you interface with mmap as if the entire thing is in RAM, it is not all in RAM - it is just a chunk set aside in virtual memory.

A common optimization technique when using mmap is to page-walk the data, which is more or less what you're doing. This is when you loop through the mmap space after calling mmap, incrementing your pointer by the page size (in your case, by the size of a line), and accessing a single byte - triggering the OS to pull all the mmap's pages into memory; triggering all these page faults. This is an optimization technique to "prime the RAM", pulling the mmap in and readying it for future use. Page-walking in a full mmap memory space is always about 60% slower than a flat out read (not counting if you use madvise(SEQUENTIAL) or other optimizations). With a read, all the data is just pipelined straight in to a buffer you've already allocated, straight into RAM, it does not get any faster. In contrast, the mmap pages are being allocated dynamically. The benefits of using mmap are the reduced memory footprint in RAM, combined with how the system can easy swap individual pages of the space in / out, purge them them as needed, and so on. With read, all the data is moved straight into RAM and treated as a monolithic structure by the OS, to move the read structure in / out of RAM the entire thing has to be copied to a swap file. You will immediately have a much larger memory footprint with a full file read. Sometimes it will not all fit in RAM, inwhich case you have a problem. Even if it fits in RAM, it might be too large and pollute the RAM, making page faults alot more common elsewhere (in contrast, the mmap structure is typically not all in RAM at once, even after you page walked it initially). The OS will not be able to purge unused parts of the read-in file from RAM when it's under memory pressure, it will have to write the entire thing to a swap file if it needs more space... because it's treated as a monolithic structure. But read is faster up front.

One common misconception about performance is that CPU optimization is more important than memory footprint. Not true - the time it takes to travel to disk exceeds the time of CPU operations by something like 8 orders of magnitude, even with todays SSDs. Therefor, when program execution speed is a concern, memory footprint and utilization is far more important. For this, and the above reasons, mmap is generally preferred for performance. The exceptions are if the file is either too small to lead to any significant memory pressure, inwhich case using read will just store the data in a buffer, the initial read will be faster... you can even store this buffer on the stack... or if you are streaming in the file, thus only a small portion of it is in memory at once and you are primarily concerned with the initial read-in time as the file will not be persisting in memory anyway.

One note when using mmap w/ msadvise(SEQUENTIAL) - when you call this, you must be absolutely sure your data IS stored sequentially, otherwise this will actually slow down the paging in of the file by about 10x.

An alternative way of using read, one which avoids some of these problems, is to use it with a streaming approach. Which is kind of what you're doing with fgets/fputs (fgets/fputs is internally implemented with read, btw). Here what you do is, in a loop, read into a buffer... modify the data, copy it to wherever you need it, & so on. Streaming like this can keep your memory consumption very low, and can be the most efficient way of doing I/O. The only downside to this streaming approach... is that you never have the entire file in memory at once, and the entire file doesn't persist in memory. Unless of course you copied the entire thing to a buffer - but if you were doing that, you may as well not have streamed the data in in the first place, so you would never do that.

Now, with your current implementation - which is a sort of streaming approach - you are using fgets() and stopping on \n. This is problematic, and is probably what's slowing down your implementation. Large, bulk reads are much more efficient than calling read() repeatedly (which is what fgets does). You don't have to use a giant buffer - you don't want excess memory pressure (which can pollute your cache & other things), & the system also has some internal buffering it uses. But you do want to be reading into a buffer of... lets say 64k in size. You definitely dont want to be calling read line by line.

In short: if you only need to loop through the data in memory, if it doesn't need to be in memory all at once, than the streaming approach w/ read() into a 64K buffer is what you should do. If you need to work with this memory all at once, and keep it in memory, use mmap() instead. And in that case, you often do want to page the memory in - but that doesn't look to be what you intend to do here.

Again, I'm not an expert, but this is my best understanding of how these things work.

Reading a little more of your comments... you should start by first testing my suggested streaming approach without line processing. See how much time the raw read is taking. It might actually be your processing of the data in the buffer that's slowing you down. If this is the case, than try adding multithreading to the processing of those lines. You might also try handling the data in a binary format - I'm not sure if it would help, but worth messing around with. I'm assuming this file is encoded in utf-8, correct...? Try changing the file encoding, that could reduce its size maybe. 3.5 million lines is quite alot of characters to loop through... what is that, like 150 million character comparisons that you are doing? Seems like that could be an issue. In a case like this, even changing the format to something like ascii-7 and doing binary processing could cut the run time by 16%. There are a variety of optimizations you can do depending on the file format itself. For example, if you can sort the file by line length prior to the program running... you can write an algorithm to much more quickly parse the lines. If that sorting of the file is something that the problem allows for. Likewise, if it's necessary to insert lines into this file, you could insert them in the appropriate place by line length.

You could even do something like create and maintain a file that is a list of all the offsets from the start of each line to its terminating newline. 3.5 million offsets. Then use this in your parsing algorithm to just automatically grab the line without having to search for the newline.

When you get into file processing algorithms such as this... it begins to resemble the implementation of a noSQL database. Another alternative might just be to insert all this data into a noSQL database! Depends on what you need to do: believe it or not, sometimes just raw custom file manipulation & maintenance is faster than any database implementation.

That's the best I've got, maybe the experts will have other ideas. Carry onward!

EDIT: I found this in my old notes that I took while reading a book on performance, which actually pertains to what we're discussing here: "one way you can speed up I/O - even faster than memory mapping, is using the streaming options of read(). This works because copying the bits into a buffer is faster than allocating new memory with mmap (which is something one poster above noted). Note the actual buffer size used with read() doesn't effect performance much, as long as it isn't too big - 64K seems reasonable. This is because the system calls in chunks and stores whatever you don't use in the universal buffer cache. However, you wouldn't use this method if you need all the data in memory at once, because then you'll have to allocate memory to store the contents in, and that memory won't benefit from memory swapping either. the one case where this can be useful is when parsing external data into a different internal format, but it will require buffer-stitching. in cases like this you may disable caching."

He mentions disabling caching with the streaming approach. Try looking into that - I think I/O is typically cached in the UBC. If there's a way to not cache it (caching means more copying), but just stream it directly into your buffer, it could help (just going by what those notes say, you'll have to look into all that)