This post has moved to eklausmeier.goip.de/blog/2016/02-03-performance-comparison-mmap-versus-read-versus-fread.
I recently read in Computers are *fast*! by Julia Evans about a comparison between fread()
and mmap()
suggesting that both calls deliver roughly the same performance. Unfortunately the codes mentioned there and referenced in bytesum.c for fread()
and bytesum_mmap.c for mmap()
do not really compare the same thing. The first adds size_t
, the second adds up uint8_t
. My computer showed that these programs do behave differently and therefore give different performance.
I reprogrammed the comparison adding read()
to fread()
and mmap()
. The code is in GitHub. Compile with
cc -Wall -O3 tbytesum1.c -o tbytesum1
For this program the results are as follows:
/home/klm/c: time ./tbytesum1 -f ubuntu-14.04-server-amd64.iso The answer is: -98049011 real 0m0.187s user 0m0.077s sys 0m0.110s /home/klm/c: time tbytesum1 -f ubuntu-14.04-server-amd64.iso The answer is: -98049011 real 0m0.193s user 0m0.100s sys 0m0.090s /home/klm/c: time tbytesum1 -r ubuntu-14.04-server-amd64.iso The answer is: -98049011 real 0m0.186s user 0m0.080s sys 0m0.103s /home/klm/c: time tbytesum1 -r ~ubuntu-14.04-server-amd64.iso The answer is: -98049011 real 0m0.196s user 0m0.110s sys 0m0.083s /home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso The answer is: -98049011 real 0m0.152s user 0m0.110s sys 0m0.040s /home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso The answer is: -98049011 real 0m0.159s user 0m0.113s sys 0m0.043s
The file in question is the Ubuntu ISO-image for the server edition, roughly 564 MB, stored on a classical hard-drive (Seagate 2TB drive). fread()
and read()
don’t make a difference. This demonstrates that the mmap()
‘ed version needs roughly half the system time (83ms down to 40ms), leading to a reduction of 20% of the total running time (186ms down to 152ms).
A similar test with a short video from a SSD:
/home/klm/c: time tbytesum1 -r CLIP0627.AVI The answer is: -122687217 real 0m0.097s user 0m0.033s sys 0m0.063s /home/klm/c: time tbytesum1 -r CLIP0627.AVI The answer is: -122687217 real 0m0.097s user 0m0.050s sys 0m0.043s /home/klm/c: time tbytesum1 -f CLIP0627.AVI The answer is: -122687217 real 0m0.093s user 0m0.050s sys 0m0.040s /home/klm/c: time tbytesum1 -f CLIP0627.AVI The answer is: -122687217 real 0m0.098s user 0m0.043s sys 0m0.053s /home/klm/c: time tbytesum1 -m CLIP0627.AVI The answer is: -122687217 real 0m0.079s user 0m0.050s sys 0m0.027s /home/klm/c: time tbytesum1 -m CLIP0627.AVI The answer is: -122687217 real 0m0.084s user 0m0.050s sys 0m0.030s
The AVI-file is roughly 259 MB. Again, fread()
and read()
don’t differ, but mmap()
is roughly 30% faster system-time-wise.
These tests were conducted on a 4.3.3-3-ARCH x86_64 system utilizing an AMD FX-8120 Eight-Core processor running up to 3.1 GHz. gcc used for compiling was 5.3.0.
Linus Torvalds gave the following remarks on read()
versus mmap()
in mmap/mlock performance versus read:
People love
mmap()
and other ways to play with the page tables to optimize away a copy operation, and sometimes it is worth it.HOWEVER, playing games with the virtual memory mapping is very expensive in itself. It has a number of quite real disadvantages that people tend to ignore because memory copying is seen as something very slow, and sometimes optimizing that copy away is seen as an obvious improvment.
Downsides to
mmap()
:
- quite noticeable setup and teardown costs. And I mean _noticeable_. It’s things like following the page tables to unmap everything cleanly. It’s the book-keeping for maintaining a list of all the mappings. It’s The TLB flush needed after unmapping stuff.
- page faulting is expensive. That’s how the mapping gets populated, and it’s quite slow.
Upsides of
mmap()
:
- if the data gets re-used over and over again (within a single map operation), or if you can avoid a lot of other logic by just mapping something in,
mmap()
is just the greatest thing since sliced bread.This may be a file that you go over many times (the binary image of an executable is the obvious case here – the code jumps all around the place), or a setup where it’s just so convenient to map the whole thing in without regard of the actual usage patterns that
mmap()
just wins. You may have random access patterns, and usemmap()
as a way of keeping track of what data you actually needed.- if the data is large,
mmap()
is a great way to let the system know what it can do with the data-set. The kernel can forget pages as memory pressure forces the system to page stuff out, and then just automatically re-fetch them again.And the automatic sharing is obviously a case of this..
It is interesting to note that the performance is kernel-dependent. The same tests conducted on Ubuntu 14.04.3 LTS, kernel version 3.13.0-74-generic #118-Ubuntu SMP, x86_64, on the exact same hardware, give a more blurred view:
/home/klm/c: time tbytesum1 -r ubuntu-14.04-server-amd64.iso The answer is: -98049011 real 0m0.191s user 0m0.072s sys 0m0.119s /home/klm/c: time tbytesum1 -r ubuntu-14.04-server-amd64.iso The answer is: -98049011 real 0m0.177s user 0m0.073s sys 0m0.104s /home/klm/c: time tbytesum1 -f ubuntu-14.04-server-amd64.iso The answer is: -98049011 real 0m0.187s user 0m0.092s sys 0m0.095s /home/klm/c: time tbytesum1 -f ubuntu-14.04-server-amd64.iso The answer is: -98049011 real 0m0.181s user 0m0.077s sys 0m0.104s /home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso The answer is: -98049011 real 0m0.174s user 0m0.104s sys 0m0.072s /home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso The answer is: -98049011 real 0m0.175s user 0m0.102s sys 0m0.071s
Again, read()
and fread()
make no real difference. The difference in system-time between read()
and mmap()
is 25% (95ms down to 71ms). Compiler was gcc 4.8.4.
Testing a file on SSD:
root@chieftec:~# time ~klm/c/tbytesum1 -r CLIP0627.AVI The answer is: -122687217 real 0m0.098s user 0m0.039s sys 0m0.059s /home/klm/c: time tbytesum1 -r CLIP0627.AVI The answer is: -122687217 real 0m0.093s user 0m0.047s sys 0m0.047s /home/klm/c: time tbytesum1 -f CLIP0627.AVI The answer is: -122687217 real 0m0.092s user 0m0.036s sys 0m0.056s /home/klm/c: time tbytesum1 -f CLIP0627.AVI The answer is: -122687217 real 0m0.087s user 0m0.040s sys 0m0.047s /home/klm/c: time tbytesum1 -m CLIP0627.AVI The answer is: -122687217 real 0m0.091s user 0m0.047s sys 0m0.043s /home/klm/c: time tbytesum1 -m CLIP0627.AVI The answer is: -122687217 real 0m0.086s user 0m0.047s sys 0m0.039s
With the file on SSD the result is even more fading away to the file stored on hard-disk: running times between read()
and mmap()
are almost identical, contrary to the result on kernel 4.3.3.
A kernel dependency was also hinted in CPU Usage Time Is Dependant on Load.
Both binaries compiled by gcc either version 5.3.0 on Arch or 4.8.4 on Ubuntu use loop-unrolling for all three functions mmaptst()
, freadtst()
, and readtst()
, which can be seen by:
objdump -d tbytesum1
Here is the assembler code:
00000000004008e0 <mmaptst>: 4008e0: 41 55 push %r13 4008e2: 41 54 push %r12 4008e4: 31 c0 xor %eax,%eax 4008e6: 55 push %rbp 4008e7: 53 push %rbx 4008e8: 48 89 f5 mov %rsi,%rbp . . . 400a5a: 66 0f fe c1 paddd %xmm1,%xmm0 400a5e: 66 0f 7e c0 movd %xmm0,%eax 400a62: 01 c3 add %eax,%ebx 400a64: 89 f8 mov %edi,%eax 400a66: 48 01 c2 add %rax,%rdx 400a69: 41 39 fa cmp %edi,%r10d 400a6c: 0f 84 a7 00 00 00 je 400b19 <mmaptst+0x239> 400a72: 0f be 02 movsbl (%rdx),%eax 400a75: 01 c3 add %eax,%ebx 400a77: 83 fe 01 cmp $0x1,%esi 400a7a: 0f 84 99 00 00 00 je 400b19 <mmaptst+0x239> 400a80: 0f be 42 01 movsbl 0x1(%rdx),%eax 400a84: 01 c3 add %eax,%ebx 400a86: 83 fe 02 cmp $0x2,%esi 400a89: 0f 84 8a 00 00 00 je 400b19 <mmaptst+0x239> 400a8f: 0f be 42 02 movsbl 0x2(%rdx),%eax . . . 400b06: 74 11 je 400b19 <mmaptst+0x239> 400b08: 0f be 42 0d movsbl 0xd(%rdx),%eax 400b0c: 01 c3 add %eax,%ebx 400b0e: 83 fe 0e cmp $0xe,%esi 400b11: 74 06 je 400b19 <mmaptst+0x239> 400b13: 0f be 42 0e movsbl 0xe(%rdx),%eax 400b17: 01 c3 add %eax,%ebx 400b19: 44 89 ef mov %r13d,%edi . . .
Compiling with gcc 5.3.0 and option march=native
, i.e.,
cc -Wall -O3 -march=native tbytesum1.c -o tbytesum1N
and reading the Ubuntu ISO-file from HD reduces real-time by roughly 10% (152ms down to 138ms), and reduces user-time by roughly 15% (110ms down to 93ms). The generated code uses AMD’s vpadd
, vpsrldq
, vpmovsxwd
instructions.
Added 05-Nov-2017: In the blog article “ripgrep is faster than {grep, ag, git grep, ucg, pt, sift}” by Andrew Gallant on the performance comparison between ag
(Silver Searcher) and rg
(ripgrep) he says:
Naively, it seems like (1) would be obviously faster. Surely, all of the bookkeeping and copying in (2) would make it much slower! In fact, this is not at all true. (1) may not require much bookkeeping from the perspective of the programmer, but there is a lot of bookkeeping going on inside the Linux kernel to maintain the memory map. (That link goes to a mailing list post that is quite old, but it still appears relevant today.)
When I first started writing ripgrep, I used the memory map approach. It took me a long time to be convinced enough to start down the second path with an intermediate buffer (because neither a CPU profile nor the output of strace ever showed any convincing evidence that memory maps were to blame), but as soon as I had a prototype of (2) working, it was clear that it was much faster than the memory map approach.
With all that said, memory maps aren’t all bad. They just happen to be bad for the particular use case of “rapidly open, scan and close memory maps for thousands of small files.” For a different use case, like, say, “open this large file and search it once,” memory maps turn out to be a boon. We’ll see that in action in our single-file benchmarks later.
You are speaking Chinese to me, but it looks good!
LikeLike