How to read part of binary file with numpy?

This is what I do when I have to read arbitrary in an heterogeneous binary file.
Numpy allows to interpret a bit pattern in arbitray way by changing the dtype of the array. The Matlab code in the question reads a char and two uint.

Read this paper (easy reading on user level, not for scientists) on what one can achieve with changing the dtype, stride, dimensionality of an array.

import numpy as np

data = np.arange(10, dtype=np.int)
data.tofile('f')

x = np.fromfile('f', dtype='u1')
print x.size
# 40

second = x[8]
print 'second', second
# second 2

total_cycles = x[8:12]
print 'total_cycles', total_cycles
total_cycles.dtype = np.dtype('u4')
print 'total_cycles', total_cycles
# total_cycles [2 0 0 0]       !endianness
# total_cycles [2]

start_cycle = x[12:16]
start_cycle.dtype = np.dtype('u4')
print 'start_cycle', start_cycle
# start_cycle [3]

x.dtype = np.dtype('u4')
print 'x', x
# x [0 1 2 3 4 5 6 7 8 9]

x[3] = 423 
print 'start_cycle', start_cycle
# start_cycle [423]

You can use seek with a file object in the normal way, and then use this file object in fromfile. Here's a full example:

import numpy as np
import os

data = np.arange(100, dtype=np.int)
data.tofile("temp")  # save the data

f = open("temp", "rb")  # reopen the file
f.seek(256, os.SEEK_SET)  # seek

x = np.fromfile(f, dtype=np.int)  # read the data into numpy
print x 
# [64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
# 89 90 91 92 93 94 95 96 97 98 99]

There probably is a better answer… But when I've been faced with this problem, I had a file that I already wanted to access different parts of separately, which gave me an easy solution to this problem.

For example, say chunkyfoo.bin is a file consisting of a 6-byte header, a 1024-byte numpy array, and another 1024-byte numpy array. You can't just open the file and seek 6 bytes (because the first thing numpy.fromfile does is lseek back to 0). But you can just mmap the file and use fromstring instead:

with open('chunkyfoo.bin', 'rb') as f:
    with closing(mmap.mmap(f.fileno(), length=0, access=mmap.ACCESS_READ)) as m:
        a1 = np.fromstring(m[6:1030])
        a2 = np.fromstring(m[1030:])

This sounds like exactly what you want to do. Except, of course, that in real life the offset and length to a1 and a2 probably depend on the header, rather than being fixed comments.

The header is just m[:6], and you can parse that by explicitly pulling it apart, using the struct module, or whatever else you'd do once you read the data. But, if you'd prefer, you can explicitly seek and read from f before constructing m, or after, or even make the same calls on m, and it will work, without affecting a1 and a2.

An alternative, which I've done for a different non-numpy-related project, is to create a wrapper file object, like this:

class SeekedFileWrapper(object):
    def __init__(self, fileobj):
        self.fileobj = fileobj
        self.offset = fileobj.tell()
    def seek(self, offset, whence=0):
        if whence == 0:
            offset += self.offset
        return self.fileobj.seek(offset, whence)
    # ... delegate everything else unchanged

I did the "delegate everything else unchanged" by generating a list of attributes at construction time and using that in __getattr__, but you probably want something less hacky. numpy only relies on a handful of methods of the file-like object, and I think they're properly documented, so just explicitly delegate those. But I think the mmap solution makes more sense here, unless you're trying to mechanically port over a bunch of explicit seek-based code. (You'd think mmap would also give you the option of leaving it as a numpy.memmap instead of a numpy.array, which lets numpy have more control over/feedback from the paging, etc. But it's actually pretty tricky to get a numpy.memmap and an mmap to work together.)

How to read part of binary file with numpy?

Tags:

Python

Numpy

Scipy

Related

Recent Posts