Performance Comparison of mmap() versus read() versus fread()

I recently read in Computers are *fast*! by Julia Evans about a comparison between fread() and mmap() suggesting that both calls deliver roughly the same performance. Unfortunately the codes mentioned there and referenced in bytesum.c for fread() and bytesum_mmap.c for mmap() do not really compare the same thing. The first adds size_t, the second adds up uint8_t. My computer showed that these programs do behave differently and therefore give different performance.

I reprogrammed the comparison adding read() to fread() and mmap(). The code is in GitHub. Compile with

cc -Wall -O3 tbytesum1.c -o tbytesum1

For this program the results are as follows:

/home/klm/c: time ./tbytesum1 -f ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.187s
user    0m0.077s
sys     0m0.110s
/home/klm/c: time tbytesum1 -f ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.193s
user    0m0.100s
sys     0m0.090s
/home/klm/c: time tbytesum1 -r ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.186s
user    0m0.080s
sys     0m0.103s
/home/klm/c: time tbytesum1 -r ~ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.196s
user    0m0.110s
sys     0m0.083s
/home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.152s
user    0m0.110s
sys     0m0.040s
/home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.159s
user    0m0.113s
sys     0m0.043s

The file in question is the Ubuntu ISO-image for the server edition, roughly 564 MB, stored on a classical hard-drive (Seagate 2TB drive). fread() and read() don’t make a difference. This demonstrates that the mmap()‘ed version needs roughly half the system time (83ms down to 40ms), leading to a reduction of 20% of the total running time (186ms down to 152ms).

A similar test with a short video from a SSD:

/home/klm/c: time tbytesum1 -r CLIP0627.AVI
The answer is: -122687217

real    0m0.097s
user    0m0.033s
sys     0m0.063s
/home/klm/c: time tbytesum1 -r CLIP0627.AVI
The answer is: -122687217

real    0m0.097s
user    0m0.050s
sys     0m0.043s
/home/klm/c: time tbytesum1 -f CLIP0627.AVI
The answer is: -122687217

real    0m0.093s
user    0m0.050s
sys     0m0.040s
/home/klm/c: time tbytesum1 -f CLIP0627.AVI
The answer is: -122687217

real    0m0.098s
user    0m0.043s
sys     0m0.053s
/home/klm/c: time tbytesum1 -m CLIP0627.AVI
The answer is: -122687217

real    0m0.079s
user    0m0.050s
sys     0m0.027s
/home/klm/c: time tbytesum1 -m CLIP0627.AVI
The answer is: -122687217

real    0m0.084s
user    0m0.050s
sys     0m0.030s

The AVI-file is roughly 259 MB. Again, fread() and read() don’t differ, but mmap() is roughly 30% faster system-time-wise.

These tests were conducted on a 4.3.3-3-ARCH x86_64 system utilizing an AMD FX-8120 Eight-Core processor running up to 3.1 GHz. gcc used for compiling was 5.3.0.

Linus Torvalds gave the following remarks on read() versus mmap() in mmap/mlock performance versus read:

People love mmap() and other ways to play with the page tables to optimize away a copy operation, and sometimes it is worth it.

HOWEVER, playing games with the virtual memory mapping is very expensive in itself. It has a number of quite real disadvantages that people tend to ignore because memory copying is seen as something very slow, and sometimes optimizing that copy away is seen as an obvious improvment.

Downsides to mmap():

  • quite noticeable setup and teardown costs. And I mean _noticeable_. It’s things like following the page tables to unmap everything cleanly. It’s the book-keeping for maintaining a list of all the mappings. It’s The TLB flush needed after unmapping stuff.
  • page faulting is expensive. That’s how the mapping gets populated, and it’s quite slow.

Upsides of mmap():

  • if the data gets re-used over and over again (within a single map operation), or if you can avoid a lot of other logic by just mapping something in, mmap() is just the greatest thing since sliced bread.

    This may be a file that you go over many times (the binary image of an executable is the obvious case here – the code jumps all around the place), or a setup where it’s just so convenient to map the whole thing in without regard of the actual usage patterns that mmap() just wins. You may have random access patterns, and use mmap() as a way of keeping track of what data you actually needed.

  • if the data is large, mmap() is a great way to let the system know what it can do with the data-set. The kernel can forget pages as memory pressure forces the system to page stuff out, and then just automatically re-fetch them again.

And the automatic sharing is obviously a case of this..

It is interesting to note that the performance is kernel-dependent. The same tests conducted on Ubuntu 14.04.3 LTS, kernel version 3.13.0-74-generic #118-Ubuntu SMP, x86_64, on the exact same hardware, give a more blurred view:

/home/klm/c: time tbytesum1 -r ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.191s
user    0m0.072s
sys     0m0.119s
/home/klm/c: time tbytesum1 -r ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.177s
user    0m0.073s
sys     0m0.104s
/home/klm/c: time tbytesum1 -f ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.187s
user    0m0.092s
sys     0m0.095s
/home/klm/c: time tbytesum1 -f ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.181s
user    0m0.077s
sys     0m0.104s
/home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.174s
user    0m0.104s
sys     0m0.072s
/home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso 
The answer is: -98049011

real    0m0.175s
user    0m0.102s
sys     0m0.071s

Again, read() and fread() make no real difference. The difference in system-time between read() and mmap() is 25% (95ms down to 71ms). Compiler was gcc 4.8.4.

Testing a file on SSD:

root@chieftec:~# time ~klm/c/tbytesum1 -r CLIP0627.AVI 
The answer is: -122687217

real    0m0.098s
user    0m0.039s
sys     0m0.059s
/home/klm/c: time tbytesum1 -r CLIP0627.AVI 
The answer is: -122687217

real    0m0.093s
user    0m0.047s
sys     0m0.047s
/home/klm/c: time tbytesum1 -f CLIP0627.AVI 
The answer is: -122687217

real    0m0.092s
user    0m0.036s
sys     0m0.056s
/home/klm/c: time tbytesum1 -f CLIP0627.AVI 
The answer is: -122687217

real    0m0.087s
user    0m0.040s
sys     0m0.047s
/home/klm/c: time tbytesum1 -m CLIP0627.AVI 
The answer is: -122687217

real    0m0.091s
user    0m0.047s
sys     0m0.043s
/home/klm/c: time tbytesum1 -m CLIP0627.AVI 
The answer is: -122687217

real    0m0.086s
user    0m0.047s
sys     0m0.039s

With the file on SSD the result is even more fading away to the file stored on hard-disk: running times between read() and mmap() are almost identical, contrary to the result on kernel 4.3.3.

A kernel dependency was also hinted in CPU Usage Time Is Dependant on Load.

Both binaries compiled by gcc either version 5.3.0 on Arch or 4.8.4 on Ubuntu use loop-unrolling for all three functions mmaptst(), freadtst(), and readtst(), which can be seen by:

objdump -d tbytesum1

Here is the assembler code:

00000000004008e0 <mmaptst>:
  4008e0:	41 55                	push   %r13
  4008e2:	41 54                	push   %r12
  4008e4:	31 c0                	xor    %eax,%eax
  4008e6:	55                   	push   %rbp
  4008e7:	53                   	push   %rbx
  4008e8:	48 89 f5             	mov    %rsi,%rbp
. . .
  400a5a:	66 0f fe c1          	paddd  %xmm1,%xmm0
  400a5e:	66 0f 7e c0          	movd   %xmm0,%eax
  400a62:	01 c3                	add    %eax,%ebx
  400a64:	89 f8                	mov    %edi,%eax
  400a66:	48 01 c2             	add    %rax,%rdx
  400a69:	41 39 fa             	cmp    %edi,%r10d
  400a6c:	0f 84 a7 00 00 00    	je     400b19 <mmaptst+0x239>
  400a72:	0f be 02             	movsbl (%rdx),%eax
  400a75:	01 c3                	add    %eax,%ebx
  400a77:	83 fe 01             	cmp    $0x1,%esi
  400a7a:	0f 84 99 00 00 00    	je     400b19 <mmaptst+0x239>
  400a80:	0f be 42 01          	movsbl 0x1(%rdx),%eax
  400a84:	01 c3                	add    %eax,%ebx
  400a86:	83 fe 02             	cmp    $0x2,%esi
  400a89:	0f 84 8a 00 00 00    	je     400b19 <mmaptst+0x239>
  400a8f:	0f be 42 02          	movsbl 0x2(%rdx),%eax
. . .
  400b06:	74 11                	je     400b19 <mmaptst+0x239>
  400b08:	0f be 42 0d          	movsbl 0xd(%rdx),%eax
  400b0c:	01 c3                	add    %eax,%ebx
  400b0e:	83 fe 0e             	cmp    $0xe,%esi
  400b11:	74 06                	je     400b19 <mmaptst+0x239>
  400b13:	0f be 42 0e          	movsbl 0xe(%rdx),%eax
  400b17:	01 c3                	add    %eax,%ebx
  400b19:	44 89 ef             	mov    %r13d,%edi
. . .

Compiling with gcc 5.3.0 and option march=native, i.e.,

cc -Wall -O3 -march=native tbytesum1.c -o tbytesum1N

and reading the Ubuntu ISO-file from HD reduces real-time by roughly 10% (152ms down to 138ms), and reduces user-time by roughly 15% (110ms down to 93ms). The generated code uses AMD’s vpadd, vpsrldq, vpmovsxwd instructions.

Advertisements

One thought on “Performance Comparison of mmap() versus read() versus fread()

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s