Is there a numpy/scipy dot product, calculating only the diagonal entries of the result?

I think i got it on my own, but nevertheless will share the solution:

since getting only the diagonals of a matrix multiplication

> Z = N.diag(X.dot(Y))

is equivalent to the individual sum of the scalar product of rows of X and columns of Y, the previous statement is equivalent to:

> Z = (X * Y.T).sum(-1)

For the original variables this means:

> result = (A.dot(B) * A).sum(-1)

Please correct me if I am wrong but this should be it ...


You can get almost anything you ever dreamed of with numpy.einsum. Until you start getting the hang of it, it basically seems like black voodoo...

>>> a = np.arange(15).reshape(5, 3)
>>> b = np.arange(9).reshape(3, 3)

>>> np.diag(np.dot(np.dot(a, b), a.T))
array([  60,  672, 1932, 3840, 6396])
>>> np.einsum('ij,ji->i', np.dot(a, b), a.T)
array([  60,  672, 1932, 3840, 6396])
>>> np.einsum('ij,ij->i', np.dot(a, b), a)
array([  60,  672, 1932, 3840, 6396])

EDIT You can actually get the whole thing in a single shot, it's ridiculous...

>>> np.einsum('ij,jk,ki->i', a, b, a.T)
array([  60,  672, 1932, 3840, 6396])
>>> np.einsum('ij,jk,ik->i', a, b, a)
array([  60,  672, 1932, 3840, 6396])

EDIT You don't want to let it figure too much on its own though... Added the OP's answer to its own question for comparison also.

n, p = 10000, 200
a = np.random.rand(n, p)
b = np.random.rand(p, p)

In [2]: %timeit np.einsum('ij,jk,ki->i', a, b, a.T)
1 loops, best of 3: 1.3 s per loop

In [3]: %timeit np.einsum('ij,ij->i', np.dot(a, b), a)
10 loops, best of 3: 105 ms per loop

In [4]: %timeit np.diag(np.dot(np.dot(a, b), a.T))
1 loops, best of 3: 5.73 s per loop

In [5]: %timeit (a.dot(b) * a).sum(-1)
10 loops, best of 3: 115 ms per loop

A pedestrian answer, which avoids the construction of large intermediate arrays is:

result=np.empty([n,], dtype=A.dtype )
for i in xrange(n):
    result[i]=A[i,:].dot(B).dot(A[i,:])