Using numpy.bincount with array weights

As per the numpy documentation:

numpy.bincount(x, weights=None, minlength=None)

weights : array_like, optional; Weights, array of the same shape as x.

So you can't use bincount directly in this fashion unless you alter x somehow.

Edit: So I came up with a slightly tricky way of doing this, but no guarantees about the performance when you go to large arrays. Basically I'm going to leverage how scipy sparse matrices handle repeated entries at the same indices (they sum them):

 from scipy.sparse import *
 a = np.array([[1,1], [2,2], [4,4]])
 ii = np.array([1, 1, 0])

 ares = a.reshape((-1,),order='F')
 # ares == array([1, 2, 4, 1, 2, 4])

 col = np.tile(ii,(a.shape[1],))
 # col == np.array([1, 1, 0, 1, 1, 0])

 row = np.tile([0,1],(a.shape[0],1)).reshape((-1,),order='F') 
 # row == np.array([0,0,0,1,1,1]) 

 g = coo_matrix((ares,(col,row)),shape=(2,2))
 print g.todense()     

Now you're going to have to generalize this to your precise data. The basic idea is that you want to map each data point to the correct element of your results array and then let the sparse array handle summing the duplicate entries.

Otherwise, I would look into using Cython if you are forced to use looping to solve this.

Edit 2: For kicks, I timed two different methods:

import numpy as np
from scipy.sparse import *

def method1():
    return np.array([np.bincount(ii, r) for r in a.T]).T

def method2():
    ares = a.reshape((-1,),order='F')
    col = np.tile(ii,(a.shape[1],))
    row = np.tile(np.arange(a.shape[1]),(a.shape[0],1)).reshape((-1,),order='F') 

    return coo_matrix((ares,(col,row)),shape=(np.unique(ii).size,a.shape[1])).todense()

if __name__ == '__main__':
    from timeit import Timer

    a = np.random.randint(0,1000,(1000000,3))
    ii = np.random.randint(0,10,(a.shape[0],))

    N = 100
    t1 = Timer("method1()", "from __main__ import method1")
    t2 = Timer("method2()", "from __main__ import method2")
    print 't2/t1: %f' % (t2.timeit(N)/t1.timeit(N))

On my machine, method2 is about 3-5x slower than method1 depending on the shape of the inputs so looping isn't necessarily a bad option.

Tags:

Python

Numpy