What is the idiomatic way to iterate over a binary file in Python?
Try:
>>> with open('dups.txt','rb') as f:
... for chunk in iter((lambda:f.read(how_many_bytes_you_want_each_time)),''):
... i+=1
iter
needs a function with zero arguments.
- a plain
f.read
would read the whole file, since thesize
parameter is missing; f.read(1024)
means call a function and pass its return value (data loaded from file) toiter
, soiter
does not get a function at all;(lambda:f.read(1234))
is a function that takes zero arguments (nothing betweenlambda
and:
) and callsf.read(1234)
.
There is equivalence between following:
somefunction = (lambda:f.read(how_many_bytes_you_want_each_time))
and
def somefunction(): return f.read(how_many_bytes_you_want_each_time)
and having one of these before your code you could just write: iter(somefunction, '')
.
Technically you can skip the parentheses around lambda, python's grammar will accept that.
I don't know of any built-in way to do this, but a wrapper function is easy enough to write:
def read_in_chunks(infile, chunk_size=1024*64):
while True:
chunk = infile.read(chunk_size)
if chunk:
yield chunk
else:
# The chunk was empty, which means we're at the end
# of the file
return
Then at the interactive prompt:
>>> from chunks import read_in_chunks
>>> infile = open('quicklisp.lisp')
>>> for chunk in read_in_chunks(infile):
... print chunk
...
<contents of quicklisp.lisp in chunks>
Of course, you can easily adapt this to use a with block:
with open('quicklisp.lisp') as infile:
for chunk in read_in_chunks(infile):
print chunk
And you can eliminate the if statement like this.
def read_in_chunks(infile, chunk_size=1024*64):
chunk = infile.read(chunk_size)
while chunk:
yield chunk
chunk = infile.read(chunk_size)
The Pythonic way to read a binary file iteratively is using the built-in function iter
with two arguments and the standard function functools.partial
, as described in the Python library documentation:
iter
(object[, sentinel])Return an iterator object. The first argument is interpreted very differently depending on the presence of the second argument. Without a second argument, object must be a collection object which supports the iteration protocol (the
__iter__()
method), or it must support the sequence protocol (the__getitem__()
method with integer arguments starting at0
). If it does not support either of those protocols,TypeError
is raised. If the second argument, sentinel, is given, then object must be a callable object. The iterator created in this case will call object with no arguments for each call to its__next__()
method; if the value returned is equal to sentinel,StopIteration
will be raised, otherwise the value will be returned.See also Iterator Types.
One useful application of the second form of
iter()
is to build a block-reader. For example, reading fixed-width blocks from a binary database file until the end of file is reached:from functools import partial with open('mydata.db', 'rb') as f: for block in iter(partial(f.read, 64), b''): process_block(block)