Which is faster to delete first line in file... sed or tail?

Performance of sed vs. tail to remove the first line of a file

TL;DR

  • sed is very powerful and versatile, but this is what makes it slow, especially for large files with many lines.

  • tail does just one simple thing, but that one it does well and fast, even for bigger files with many lines.

For small and medium sized files, sed and tail are performing similarly fast (or slow, depending on your expectations). However, for larger input files (multiple MBs), the performance difference grows significantly (an order of magnitude for files in the range of hundreds of MBs), with tail clearly outperforming sed.

Experiment

General Preparations:

Our commands to analyze are:

sed '1d' testfile > /dev/null
tail -n +2 testfile > /dev/null

Note that I'm piping the output to /dev/null each time to eliminate the terminal output or file writes as performance bottleneck.

Let's set up a RAM disk to eliminate disk I/O as potential bottleneck. I personally have a tmpfs mounted at /tmp so I simply placed my testfile there for this experiment.

Then I am once creating a random test file containing a specified amount of lines $numoflines with random line length and random data using this command (note that it's definitely not optimal, it becomes really slow for about >2M lines, but who cares, it's not the thing we're analyzing):

cat /dev/urandom | base64 -w0 | tr 'n' '\n'| head -n "$numoflines" > testfile

Oh, btw. my test laptop is running Ubuntu 16.04, 64 bit on an Intel i5-6200U CPU. Just for comparison.

Timing big files:

Setting up a huge testfile:

Running the command above with numoflines=10000000 produced a random file containing 10M lines, occupying a bit over 600 MB - it's quite huge, but let's start with it, because we can:

$ wc -l testfile 
10000000 testfile

$ du -h testfile 
611M    testfile

$ head -n 3 testfile 
qOWrzWppWJxx0e59o2uuvkrfjQbzos8Z0RWcCQPMGFPueRKqoy1mpgjHcSgtsRXLrZ8S4CU8w6O6pxkKa3JbJD7QNyiHb4o95TSKkdTBYs8uUOCRKPu6BbvG
NklpTCRzUgZK
O/lcQwmJXl1CGr5vQAbpM7TRNkx6XusYrO

Perform the timed run with our huge testfile:

Now let's do just a single timed run with both commands first to estimate with what magnitudes we're working.

$ time sed '1d' testfile > /dev/null
real    0m2.104s
user    0m1.944s
sys     0m0.156s

$ time tail -n +2 testfile > /dev/null
real    0m0.181s
user    0m0.044s
sys     0m0.132s

We already see a really clear result for big files, tail is a magnitude faster than sed. But just for fun and to be sure there are no random side effects making a big difference, let's do it 100 times:

$ time for i in {1..100}; do sed '1d' testfile > /dev/null; done
real    3m36.756s
user    3m19.756s
sys     0m15.792s

$ time for i in {1..100}; do tail -n +2 testfile > /dev/null; done
real    0m14.573s
user    0m1.876s
sys     0m12.420s

The conclusion stays the same, sed is inefficient to remove the first line of a big file, tail should be used there.

And yes, I know Bash's loop constructs are slow, but we're only doing relatively few iterations here and the time a plain loop takes is not significant compared to the sed/tail runtimes anyway.

Timing small files:

Setting up a small testfile:

Now for completeness, let's look at the more common case that you have a small input file in the kB range. Let's create a random input file with numoflines=100, looking like this:

$ wc -l testfile 
100 testfile

$ du -h testfile 
8,0K    testfile

$ head -n 3 testfile 
tYMWxhi7GqV0DjWd
pemd0y3NgfBK4G4ho/
aItY/8crld2tZvsU5ly

Perform the timed run with our small testfile:

As we can expect the timings for such small files to be in the range of a few milliseconds from experience, let's just do 1000 iterations right away:

$ time for i in {1..1000}; do sed '1d' testfile > /dev/null; done
real    0m7.811s
user    0m0.412s
sys     0m7.020s

$ time for i in {1..1000}; do tail -n +2 testfile > /dev/null; done
real    0m7.485s
user    0m0.292s
sys     0m6.020s

As you can see, the timings are quite similar, there's not much to interpret or wonder about. For small files, both tools are equally well suited.


Here's another alternative, using just bash builtins and cat:

{ read ; cat > headerless.txt; } < $file

$file is redirected into the { } command grouping. The read simply reads and discards the first line. The rest of the stream is then piped to cat which writes it to the destination file.

On my Ubuntu 16.04 the performance of this and the tail solution are very similar. I created a largish test file with seq:

$ seq 100000000 > 100M.txt
$ ls -l 100M.txt 
-rw-rw-r-- 1 ubuntu ubuntu 888888898 Dec 20 17:04 100M.txt
$

tail solution:

$ time tail -n +2 100M.txt > headerless.txt

real    0m1.469s
user    0m0.052s
sys 0m0.784s
$ 

cat/brace solution:

$ time { read ; cat > headerless.txt; } < 100M.txt 

real    0m1.877s
user    0m0.000s
sys 0m0.736s
$ 

I only have an Ubuntu VM handy right now though, and saw significant variation in the timings of both, though they're all in the same ballpark.


Trying in on my system, and prefixing each command with time I got the following results:

sed:

real    0m0.129s
user    0m0.012s
sys     0m0.000s

and tail:

real    0m0.003s
user    0m0.000s
sys     0m0.000s

which suggest that, on my system at least AMD FX 8250 running Ubuntu 16.04, tail is significantly faster. The test file had 10,000 lines with a size of 540k. The file was read from a HDD.

Tags:

Sed

Tail

Scripts