cat line X to line Y on a huge file
I suggest the sed
solution, but for the sake of completeness,
awk 'NR >= 57890000 && NR <= 57890010' /path/to/file
To cut out after the last line:
awk 'NR < 57890000 { next } { print } NR == 57890010 { exit }' /path/to/file
Speed test (here on macOS, YMMV on other systems):
- 100,000,000-line file generated by
seq 100000000 > test.in
- Reading lines 50,000,000-50,000,010
- Tests in no particular order
real
time as reported bybash
's builtintime
4.373 4.418 4.395 tail -n+50000000 test.in | head -n10
5.210 5.179 6.181 sed -n '50000000,50000010p;57890010q' test.in
5.525 5.475 5.488 head -n50000010 test.in | tail -n10
8.497 8.352 8.438 sed -n '50000000,50000010p' test.in
22.826 23.154 23.195 tail -n50000001 test.in | head -n10
25.694 25.908 27.638 ed -s test.in <<<"50000000,50000010p"
31.348 28.140 30.574 awk 'NR<57890000{next}1;NR==57890010{exit}' test.in
51.359 50.919 51.127 awk 'NR >= 57890000 && NR <= 57890010' test.in
These are by no means precise benchmarks, but the difference is clear and repeatable enough* to give a good sense of the relative speed of each of these commands.
*: Except between the first two, sed -n p;q
and head|tail
, which seem to be essentially the same.
If you want lines X to Y inclusive (starting the numbering at 1), use
tail -n "+$X" /path/to/file | head -n "$((Y-X+1))"
tail
will read and discard the first X-1 lines (there's no way around that), then read and print the following lines. head
will read and print the requested number of lines, then exit. When head
exits, tail
receives a SIGPIPE signal and dies, so it won't have read more than a buffer size's worth (typically a few kilobytes) of lines from the input file.
Alternatively, as gorkypl suggested, use sed:
sed -n -e "$X,$Y p" -e "$Y q" /path/to/file
The sed solution is significantly slower though (at least for GNU utilities and Busybox utilities; sed might be more competitive if you extract a large part of the file on an OS where piping is slow and sed is fast). Here are quick benchmarks under Linux; the data was generated by seq 100000000 >/tmp/a
, the environment is Linux/amd64, /tmp
is tmpfs and the machine is otherwise idle and not swapping.
real user sys command
0.47 0.32 0.12 </tmp/a tail -n +50000001 | head -n 10 #GNU
0.86 0.64 0.21 </tmp/a tail -n +50000001 | head -n 10 #BusyBox
3.57 3.41 0.14 sed -n -e '50000000,50000010 p' -e '50000010q' /tmp/a #GNU
11.91 11.68 0.14 sed -n -e '50000000,50000010 p' -e '50000010q' /tmp/a #BusyBox
1.04 0.60 0.46 </tmp/a tail -n +50000001 | head -n 40000001 >/dev/null #GNU
7.12 6.58 0.55 </tmp/a tail -n +50000001 | head -n 40000001 >/dev/null #BusyBox
9.95 9.54 0.28 sed -n -e '50000000,90000000 p' -e '90000000q' /tmp/a >/dev/null #GNU
23.76 23.13 0.31 sed -n -e '50000000,90000000 p' -e '90000000q' /tmp/a >/dev/null #BusyBox
If you know the byte range you want to work with, you can extract it faster by skipping directly to the start position. But for lines, you have to read from the beginning and count newlines. To extract blocks from x inclusive to y exclusive starting at 0, with a block size of b:
dd bs="$b" seek="$x" count="$((y-x))" </path/to/file
The head | tail
approach is one of the best and most "idiomatic" ways to do this:
X=57890000
Y=57890010
< infile.txt head -n "$Y" | tail -n +"$X"
As pointed out by Gilles in the comments, a faster way is
< infile.txt tail -n +"$X" | head -n "$((Y - X))"
The reason this is faster is the first X - 1 lines don't need to go through the pipe compared to the head | tail
approach.
Your question as phrased is a bit misleading and probably explains some of your unfounded misgivings towards this approach.
You say you have to calculate
A
,B
,C
,D
but as you can see, the line count of the file is not needed and at most 1 calculation is necessary, which the shell can do for you anyways.You worry that piping will read more lines than necessary. In fact this is not true:
tail | head
is about as efficient as you can get in terms of file I/O. First, consider the minimum amount of work necessary: to find the X'th line in a file, the only general way to do it is to read every byte and stop when you count X newline symbols as there is no way to divine the file offset of the X'th line. Once you reach the *X*th line, you have to read all the lines in order to print them, stopping at the Y'th line. Thus no approach can get away with reading less than Y lines. Now,head -n $Y
reads no more than Y lines (rounded to the nearest buffer unit, but buffers if used correctly improve performance, so no need to worry about that overhead). In addition,tail
will not read any more thanhead
, so thus we have shown thathead | tail
reads the fewest number of lines possible (again, plus some negligible buffering that we are ignoring). The only efficiency advantage of a single tool approach that does not use pipes is fewer processes (and thus less overhead).