Python fastest access to line in file

I would probably just use itertools.islice. Using islice over an iterable like a file handle means the whole file is never read into memory, and the first 4002 lines are discarded as quickly as possible. You could even cast the two lines you need into a list pretty cheaply (assuming the lines themselves aren't very long). Then you can exit the with block, closing the filehandle.

from itertools import islice
with open('afile') as f:
    lines = list(islice(f, 4003, 4005))
do_something_with(lines)

Update

But holy cow is linecache faster for multiple accesses. I created a million-line file to compare islice and linecache and linecache blew it away.

>>> timeit("x=islice(open('afile'), 4003, 4005); print next(x) + next(x)", 'from itertools import islice', number=1)
4003
4004

0.00028586387634277344
>>> timeit("print getline('afile', 4003) + getline('afile', 4004)", 'from linecache import getline', number=1)
4002
4003

2.193450927734375e-05

>>> timeit("getline('afile', 4003) + getline('afile', 4004)", 'from linecache import getline', number=10**5)
0.14125394821166992
>>> timeit("''.join(islice(open('afile'), 4003, 4005))", 'from itertools import islice', number=10**5)
14.732316970825195

Constantly re-importing and re-reading the file:

This is not a practical test, but even re-importing linecache at each step it's only a second slower than islice.

>>> timeit("from linecache import getline; getline('afile', 4003) + getline('afile', 4004)", number=10**5)
15.613967180252075

Conclusion

Yes, linecache is faster than islice for all but constantly re-creating the linecache, but who does that? For the likely scenarios (reading only a few lines, once, and reading many lines, once) linecache is faster and presents a terse syntax, but the islice syntax is quite clean and fast as well and doesn't ever read the whole file into memory. On a RAM-tight environment, the islice solution may be the right choice. For very high speed requirements, linecache may be the better choice. Practically, though, in most environments both times are small enough it almost doesn't matter.


The main problem here is, that linebreaks are in no way different than any other character. So the OS has no way of skipping to that line.

That said there are a few options but for every one you have to make sacrifices in one way or another.

You did already state the first one: Use a binary file. If you have fixed line-length, then you can seek ahead line * bytes_per_line bytes and jump directly to that line.

The next option would be using an index: create a second file and in every line of this index file write the byte-index of the line in your datafile. Accessing the datafile now involves two seek operation (skip to line of index, then skip to index_value in datafile) but it will still be pretty fast. Plus: Will save diskspace because the lines can have different length. Minus: You can't touch the datafile with an editor.

One more option: (I think I would go with this) is to use only one file but begin every line with the line-number and some kind of seperator. (e.g. 4005: My data line). Now you can use a modified version of binary search https://en.wikipedia.org/wiki/Binary_search_algorithm to seek for your line. This will take around log(n) seek operations with n being the total number of lines. Plus: You can edit the file and it saves space compared to fixed length lines. And it's still very fast. Even for one million lines this are only about 20 seek operations which happen in no time. Minus: The most complex of these posibilities. (But fun to do ;)

EDIT: One more solution: Split your file in many smaler ones. If you have very long 'lines' this could be as small as one line per file. But then I would put them in groups in folders like e.g. 4/0/05. But even with shorter lines divide your file in - let's say roughly - 1mb chunks, name them 1000.txt, 2000.txt and read the one (or two) matching your line completely should be pretty fast end very easy to implement.