How to read records terminated by custom separator from file in python?
There is nothing in the Python 2.x file
object, or the Python 3.3 io
classes, that lets you specify a custom delimiter for readline
. (The for line in file
is ultimately using the same code as readline
.)
But it's pretty easy to build it yourself. For example:
def delimited(file, delimiter='\n', bufsize=4096):
buf = ''
while True:
newbuf = file.read(bufsize)
if not newbuf:
yield buf
return
buf += newbuf
lines = buf.split(delimiter)
for line in lines[:-1]:
yield line
buf = lines[-1]
Here's a stupid example of it in action:
>>> s = io.StringIO('abcZZZdefZZZghiZZZjklZZZmnoZZZpqr')
>>> d = delimited(s, 'ZZZ', bufsize=2)
>>> list(d)
['abc', 'def', 'ghi', 'jkl', 'mno', 'pqr']
If you want to get it right for both binary and text files, especially in 3.x, it's a bit trickier. But if it only has to work for one or the other (and one language or the other), you can ignore that.
Likewise, if you're using Python 3.x (or using io
objects in Python 2.x), and want to make use of the buffers that are already being maintained in a BufferedIOBase
instead of just putting a buffer on top of the buffer, that's trickier. The io
docs do explain how to do everything… but I don't know of any simple examples, so you're really going to have to read at least half of that page and skim the rest. (Of course, you could just use the raw files directly… but not if you want to find unicode delimiters…)