Efficient in-place header removing for large files using sed?
Try ed
instead:
ed <<< $'1d\nwq' large_file
If that “large” means about 10 million lines or more, better use tail
. Is not able for in-place editing, but its performance makes that lack forgivable:
tail -n +2 large_file > large_file.new
Edit to show some time differences:
(awk
code by Jaypal added to have execution times on the same machine (CPU 2.2GHz).)
bash-4.2$ seq 1000000 > bigfile.txt # further file creations skipped
bash-4.2$ time sed -i 1d bigfile.txt
time 0m4.318s
bash-4.2$ time ed -s <<< $'1d\nwq' bigfile.txt
time 0m0.533s
bash-4.2$ time perl -pi -e 'undef$_ if$.==1' bigfile.txt
time 0m0.626s
bash-4.2$ time { tail -n +2 bigfile.txt > bigfile.new && mv -f bigfile.new bigfile.txt; }
time 0m0.034s
bash-4.2$ time { awk 'NR>1 {print}' bigfile.txt > newfile.txt && mv -f newfile.txt bigfile.txt; }
time 0m0.328s
There is no way to efficiently remove things from the start of a file. Removing data from the beginning requires re-writing the whole file.
Truncating from the end of a file can be very quick though (the OS only has to adjust the file size information, possibly clearing up now-unused blocks). This is not generally possible when you try to remove from the head of a file.
It could theoretically be "fast" if you removed a whole block/extent exactly, but there are no system calls for that, so you'd have to rely on filesystem-specific semantics (if such exist). (Or having some form of offset inside the first block/extent to mark the real start of file, I guess. Never heard of that either.)
The most efficient method, don't do it ! If you do, in any case, you need twice the 'large' space on disk, and you waste IOs.
If you are stuck with a large file that you want to read without the 1st line, wait until you need to read it for removing the 1st line. If you need to send the file from stdin to a program, use tail to do it:
tail -n +2 | your_program
When you need to read the file, you can take the opportunity to remove the 1st line, but only if you have the needed space on disk:
tail -n +2 | tee large_file2 | your_program
If you can't read from stdin, use a fifo:
mkfifo large_file_wo_1st_line
tail -n +2 large_file > large_file_wo_1st_line&
your_program -i large_file_wo_1st_line
of even better if you are using bash, take advantage of process substitution:
your_program -i <(tail -n +2 large_file)
If you need seeking in the file, I do not see a better solution than not getting stuck with the file in the first place. If this file was generated by stdout:
large_file_generator | tail -n +2 > large_file
Else, there is always the fifo or process substitution solution:
mkfifo large_file_with_1st_file
large_file_generator -o large_file_with_1st_file&
tail -n +2 large_file_with_1st_file > large_file_wo_1st_file
large_file_generator -o >(tail -n 2+ > large_file_wo_1st_file)