Question

I’m working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I’ve got a first implementation up and running and am now looking towards improving performance, particularly at doing I/O more efficiently since the input file gets scanned many times.

Is there a rule of thumb for using mmap() versus reading in blocks via C++'s fstream library? What I’d like to do is read large blocks from disk into a buffer, process complete records from the buffer, and then read more.

The mmap() code could potentially get very messy since mmap’d blocks need to lie on page sized boundaries (my understanding) and records could potentially like across page boundaries. With fstreams, I can just seek to the start of a record and begin reading again, since we’re not limited to reading blocks that lie on page sized boundaries.

How can I decide between these two options without actually writing up a complete implementation first? Any rules of thumb (e.g., mmap() is 2x faster) or simple tests?

Answer by Dietrich Epp

I was trying to find the final word on mmap / read performance on Linux and I came across a nice post (link) on the Linux kernel mailing list. It’s from 2000, so there have been many improvements to IO and virtual memory in the kernel since then, but it nicely explains the reason why mmap or read might be faster or slower.

• A call to mmap has more overhead than read (just like epoll has more overhead than poll, which has more overhead than read). Changing virtual memory mappings is a quite expensive operation on some processors for the same reasons that switching between different processes is expensive.
• The IO system can already use the disk cache, so if you read a file, you’ll hit the cache or miss it no matter what method you use.

However,

• Memory maps are generally faster for random access, especially if your access patterns are sparse and unpredictable.
• Memory maps allow you to keep using pages from the cache until you are done. This means that if you use a file heavily for a long period of time, then close it and reopen it, the pages will still be cached. With read, your file may have been flushed from the cache ages ago. This does not apply if you use a file and immediately discard it. (If you try to mlock pages just to keep them in cache, you are trying to outsmart the disk cache and this kind of foolery rarely helps system performance).
• Reading a file directly is very simple and fast.

The discussion of mmap/read reminds me of two other performance discussions:

• Some Java programmers were shocked to discover that nonblocking I/O is often slower than blocking I/O, which made perfect sense if you know that nonblocking I/O requires making more syscalls.

• Some other network programmers were shocked to learn that epoll is often slower than poll, which makes perfect sense if you know that managing epoll requires making more syscalls.

Conclusion: Use memory maps if you access data randomly, keep it around for a long time, or if you know you can share it with other processes (MAP_SHARED isn’t very interesting if there is no actual sharing). Read files normally if you access data sequentially or discard it after reading. And if either method makes your program less complex, do that. For many real world cases there’s no sure way to show one is faster without testing your actual application and NOT a benchmark.

(Sorry for necro’ing this question, but I was looking for an answer and this question kept coming up at the top of Google results.)