How to read part of binary file with numpy?
This is what I do when I have to read arbitrary in an heterogeneous binary file.
Numpy allows to interpret a bit pattern in arbitray way by changing the dtype of the array.
The Matlab code in the question reads a char
and two uint
.
Read this paper (easy reading on user level, not for scientists) on what one can achieve with changing the dtype, stride, dimensionality of an array.
import numpy as np
data = np.arange(10, dtype=np.int)
data.tofile('f')
x = np.fromfile('f', dtype='u1')
print x.size
# 40
second = x[8]
print 'second', second
# second 2
total_cycles = x[8:12]
print 'total_cycles', total_cycles
total_cycles.dtype = np.dtype('u4')
print 'total_cycles', total_cycles
# total_cycles [2 0 0 0] !endianness
# total_cycles [2]
start_cycle = x[12:16]
start_cycle.dtype = np.dtype('u4')
print 'start_cycle', start_cycle
# start_cycle [3]
x.dtype = np.dtype('u4')
print 'x', x
# x [0 1 2 3 4 5 6 7 8 9]
x[3] = 423
print 'start_cycle', start_cycle
# start_cycle [423]
You can use seek with a file object in the normal way, and then use this file object in fromfile
. Here's a full example:
import numpy as np
import os
data = np.arange(100, dtype=np.int)
data.tofile("temp") # save the data
f = open("temp", "rb") # reopen the file
f.seek(256, os.SEEK_SET) # seek
x = np.fromfile(f, dtype=np.int) # read the data into numpy
print x
# [64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
# 89 90 91 92 93 94 95 96 97 98 99]
There probably is a better answer… But when I've been faced with this problem, I had a file that I already wanted to access different parts of separately, which gave me an easy solution to this problem.
For example, say chunkyfoo.bin
is a file consisting of a 6-byte header, a 1024-byte numpy
array, and another 1024-byte numpy
array. You can't just open the file and seek 6 bytes (because the first thing numpy.fromfile
does is lseek
back to 0). But you can just mmap
the file and use fromstring
instead:
with open('chunkyfoo.bin', 'rb') as f:
with closing(mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ)) as m:
a1 = np.fromstring(m[6:1030])
a2 = np.fromstring(m[1030:])
This sounds like exactly what you want to do. Except, of course, that in real life the offset and length to a1
and a2
probably depend on the header, rather than being fixed comments.
The header is just m[:6]
, and you can parse that by explicitly pulling it apart, using the struct
module, or whatever else you'd do once you read
the data. But, if you'd prefer, you can explicitly seek
and read
from f
before constructing m
, or after, or even make the same calls on m
, and it will work, without affecting a1
and a2
.
An alternative, which I've done for a different non-numpy
-related project, is to create a wrapper file object, like this:
class SeekedFileWrapper(object):
def __init__(self, fileobj):
self.fileobj = fileobj
self.offset = fileobj.tell()
def seek(self, offset, whence=0):
if whence == 0:
offset += self.offset
return self.fileobj.seek(offset, whence)
# ... delegate everything else unchanged
I did the "delegate everything else unchanged" by generating a list
of attributes at construction time and using that in __getattr__
, but you probably want something less hacky. numpy
only relies on a handful of methods of the file-like object, and I think they're properly documented, so just explicitly delegate those. But I think the mmap
solution makes more sense here, unless you're trying to mechanically port over a bunch of explicit seek
-based code. (You'd think mmap
would also give you the option of leaving it as a numpy.memmap
instead of a numpy.array
, which lets numpy
have more control over/feedback from the paging, etc. But it's actually pretty tricky to get a numpy.memmap
and an mmap
to work together.)