Why is reading lines from stdin much slower in C++ than Python?
Just out of curiosity I've taken a look at what happens under the hood, and I've used dtruss/strace on each test.
C++
./a.out < in
Saw 6512403 lines in 8 seconds. Crunch speed: 814050
syscalls sudo dtruss -c ./a.out < in
CALL COUNT
__mac_syscall 1
<snip>
open 6
pread 8
mprotect 17
mmap 22
stat64 30
read_nocancel 25958
Python
./a.py < in
Read 6512402 lines in 1 seconds. LPS: 6512402
syscalls sudo dtruss -c ./a.py < in
CALL COUNT
__mac_syscall 1
<snip>
open 5
pread 8
mprotect 17
mmap 21
stat64 29
tl;dr: Because of different default settings in C++ requiring more system calls.
By default, cin
is synchronized with stdio, which causes it to avoid any input buffering. If you add this to the top of your main, you should see much better performance:
std::ios_base::sync_with_stdio(false);
Normally, when an input stream is buffered, instead of reading one character at a time, the stream will be read in larger chunks. This reduces the number of system calls, which are typically relatively expensive. However, since the FILE*
based stdio
and iostreams
often have separate implementations and therefore separate buffers, this could lead to a problem if both were used together. For example:
int myvalue1;
cin >> myvalue1;
int myvalue2;
scanf("%d",&myvalue2);
If more input was read by cin
than it actually needed, then the second integer value wouldn't be available for the scanf
function, which has its own independent buffer. This would lead to unexpected results.
To avoid this, by default, streams are synchronized with stdio
. One common way to achieve this is to have cin
read each character one at a time as needed using stdio
functions. Unfortunately, this introduces a lot of overhead. For small amounts of input, this isn't a big problem, but when you are reading millions of lines, the performance penalty is significant.
Fortunately, the library designers decided that you should also be able to disable this feature to get improved performance if you knew what you were doing, so they provided the sync_with_stdio
method. From this link (emphasis added):
If the synchronization is turned off, the C++ standard streams are allowed to buffer their I/O independently, which may be considerably faster in some cases.
I'm a few years behind here, but:
In 'Edit 4/5/6' of the original post, you are using the construction:
$ /usr/bin/time cat big_file | program_to_benchmark
This is wrong in a couple of different ways:
You're actually timing the execution of
cat
, not your benchmark. The 'user' and 'sys' CPU usage displayed bytime
are those ofcat
, not your benchmarked program. Even worse, the 'real' time is also not necessarily accurate. Depending on the implementation ofcat
and of pipelines in your local OS, it is possible thatcat
writes a final giant buffer and exits long before the reader process finishes its work.Use of
cat
is unnecessary and in fact counterproductive; you're adding moving parts. If you were on a sufficiently old system (i.e. with a single CPU and -- in certain generations of computers -- I/O faster than CPU) -- the mere fact thatcat
was running could substantially color the results. You are also subject to whatever input and output buffering and other processingcat
may do. (This would likely earn you a 'Useless Use Of Cat' award if I were Randal Schwartz.
A better construction would be:
$ /usr/bin/time program_to_benchmark < big_file
In this statement it is the shell which opens big_file, passing it to your program (well, actually to time
which then executes your program as a subprocess) as an already-open file descriptor. 100% of the file reading is strictly the responsibility of the program you're trying to benchmark. This gets you a real reading of its performance without spurious complications.
I will mention two possible, but actually wrong, 'fixes' which could also be considered (but I 'number' them differently as these are not things which were wrong in the original post):
A. You could 'fix' this by timing only your program:
$ cat big_file | /usr/bin/time program_to_benchmark
B. or by timing the entire pipeline:
$ /usr/bin/time sh -c 'cat big_file | program_to_benchmark'
These are wrong for the same reasons as #2: they're still using cat
unnecessarily. I mention them for a few reasons:
they're more 'natural' for people who aren't entirely comfortable with the I/O redirection facilities of the POSIX shell
there may be cases where
cat
is needed (e.g.: the file to be read requires some sort of privilege to access, and you do not want to grant that privilege to the program to be benchmarked:sudo cat /dev/sda | /usr/bin/time my_compression_test --no-output
)in practice, on modern machines, the added
cat
in the pipeline is probably of no real consequence.
But I say that last thing with some hesitation. If we examine the last result in 'Edit 5' --
$ /usr/bin/time cat temp_big_file | wc -l
0.01user 1.34system 0:01.83elapsed 74%CPU ...
-- this claims that cat
consumed 74% of the CPU during the test; and indeed 1.34/1.83 is approximately 74%. Perhaps a run of:
$ /usr/bin/time wc -l < temp_big_file
would have taken only the remaining .49 seconds! Probably not: cat
here had to pay for the read()
system calls (or equivalent) which transferred the file from 'disk' (actually buffer cache), as well as the pipe writes to deliver them to wc
. The correct test would still have had to do those read()
calls; only the write-to-pipe and read-from-pipe calls would have been saved, and those should be pretty cheap.
Still, I predict you would be able to measure the difference between cat file | wc -l
and wc -l < file
and find a noticeable (2-digit percentage) difference. Each of the slower tests will have paid a similar penalty in absolute time; which would however amount to a smaller fraction of its larger total time.
In fact I did some quick tests with a 1.5 gigabyte file of garbage, on a Linux 3.13 (Ubuntu 14.04) system, obtaining these results (these are actually 'best of 3' results; after priming the cache, of course):
$ time wc -l < /tmp/junk
real 0.280s user 0.156s sys 0.124s (total cpu 0.280s)
$ time cat /tmp/junk | wc -l
real 0.407s user 0.157s sys 0.618s (total cpu 0.775s)
$ time sh -c 'cat /tmp/junk | wc -l'
real 0.411s user 0.118s sys 0.660s (total cpu 0.778s)
Notice that the two pipeline results claim to have taken more CPU time (user+sys) than real wall-clock time. This is because I'm using the shell (bash)'s built-in 'time' command, which is cognizant of the pipeline; and I'm on a multi-core machine where separate processes in a pipeline can use separate cores, accumulating CPU time faster than realtime. Using /usr/bin/time
I see smaller CPU time than realtime -- showing that it can only time the single pipeline element passed to it on its command line. Also, the shell's output gives milliseconds while /usr/bin/time
only gives hundredths of a second.
So at the efficiency level of wc -l
, the cat
makes a huge difference: 409 / 283 = 1.453 or 45.3% more realtime, and 775 / 280 = 2.768, or a whopping 177% more CPU used! On my random it-was-there-at-the-time test box.
I should add that there is at least one other significant difference between these styles of testing, and I can't say whether it is a benefit or fault; you have to decide this yourself:
When you run cat big_file | /usr/bin/time my_program
, your program is receiving input from a pipe, at precisely the pace sent by cat
, and in chunks no larger than written by cat
.
When you run /usr/bin/time my_program < big_file
, your program receives an open file descriptor to the actual file. Your program -- or in many cases the I/O libraries of the language in which it was written -- may take different actions when presented with a file descriptor referencing a regular file. It may use mmap(2)
to map the input file into its address space, instead of using explicit read(2)
system calls. These differences could have a far larger effect on your benchmark results than the small cost of running the cat
binary.
Of course it is an interesting benchmark result if the same program performs significantly differently between the two cases. It shows that, indeed, the program or its I/O libraries are doing something interesting, like using mmap()
. So in practice it might be good to run the benchmarks both ways; perhaps discounting the cat
result by some small factor to "forgive" the cost of running cat
itself.