Cheap way to search a large text file for a string

If it is "pretty large" file, then access the lines sequentially and don't read the whole file into memory:

with open('largeFile', 'r') as inF:
    for line in inF:
        if 'myString' in line:
            # do_something

The following function works for textfiles and binary files (returns only position in byte-count though), it does have the benefit to find strings even if they would overlap a line or buffer and would not be found when searching line- or buffer-wise.

def fnd(fname, s, start=0):
    with open(fname, 'rb') as f:
        fsize = os.path.getsize(fname)
        bsize = 4096
        buffer = None
        if start > 0:
            f.seek(start)
        overlap = len(s) - 1
        while True:
            if (f.tell() >= overlap and f.tell() < fsize):
                f.seek(f.tell() - overlap)
            buffer = f.read(bsize)
            if buffer:
                pos = buffer.find(s)
                if pos >= 0:
                    return f.tell() - (len(buffer) - pos)
            else:
                return -1

The idea behind this is:

  • seek to a start position in file
  • read from file to buffer (the search strings has to be smaller than the buffer size) but if not at the beginning, drop back the - 1 bytes, to catch the string if started at the end of the last read buffer and continued on the next one.
  • return position or -1 if not found

I used something like this to find signatures of files inside larger ISO9660 files, which was quite fast and did not use much memory, you can also use a larger buffer to speed things up.


You could do a simple find:

f = open('file.txt', 'r')
lines = f.read()
answer = lines.find('string')

A simple find will be quite a bit quicker than regex if you can get away with it.

Tags:

Python