Why is calling Dot with many arguments so inefficient?

Here's an edited version of my answer to a related question (elsewhere).

Since your central question was about speed (or time complexity), you might wish to know an important result from elementary theory of algorithms and computational complexity, which is that the time and space complexity of matrix multiplication depends upon the order of such multiplication (see Introduction to Algorithms by Cormen, Leiserson and Rivest). As such, your timing will depend upon the order of multiplication (specified by grouping) and the dimensions of your vectors and matrices. Here is an example:

A = Table[RandomReal[], {3000}];
X = Table[RandomReal[], {3000}, {8000}];
Y = Table[RandomReal[], {8000}, {8000}];

Timing[A.X.Y.X\[Transpose].A]

{0.956274, 1.79343*10^13}

Timing[(A.X).Y.(X\[Transpose].A)]

{0.544384, 1.79343*10^13}

Timing[A.(X.Y).(X\[Transpose].A)]

{36.569151, 1.79343*10^13} *)

So whereas all these give the same answer, one is 1/67 times as fast as another, which in turn is twice as fast (for these dimensions) than other solutions. If you know your matrix sizes ahead of time, you can choose the computation order optimally, as given in Theory of Algorithms. For dimensions around $1000$, all the results should be fast enough for your purposes.

Note too, that if you have very large matrices, you may want to Parallelize[] your computation and here too, the groupings can affect the timing (and space complexity, or memory usage) significantly.

Why does the order of multiplication matter?

Because several comments expressed an interest in understanding the roots of the importance of proper grouping of matrices during multiplication, let me give a simple example to illustrate. Let $A_{p\times q}$, $B_{q\times r}$ and $C_{p \times r}$ be three matrices with the dimensions given, and we seek to compute $ABC$. Consider $A_{10 \times 100}$, $B_{100\times 5}$ and $C_{5\times 50}$.

If we multiply according to $(AB)C$ we perform $10⋅5⋅5 = 5000$ scalar multiplications for $(AB)$, then an additional $10⋅5⋅50$ to then multiply that result by $C$: total = $7500$ multiplications.

If we multiply instead according to $A(BC)$ we perform $100⋅5⋅50 = 25000$ scalar multiplications for $(BC)$ then an additional $10⋅100⋅50 = 50000$ to then multiply that result on the left by $A$: total = $75000$ multiplications--a factor of $10$ higher than in the other case.

I don't know whether Mathematica's matrix multiplication routines are "smart" about grouping (to reduce computational cost) or instead (as I suspect) merely operate left-to-right.

There are smart linear programming methods to determine the optimal (i.e., least computationally complex) groupings, but that would take us far afield here...

It's an issue of growth of term size in this example. If you do the straight iteration then at each step the matrix elements roughly quadruple in size (because each time you multiply every element by a variable, and sum four such products per matrix entry). We confirm this below on an example of half the size.

tab = Partition[#, 4] & /@ 
   Table[Symbol[CharacterRange["a", "z"][[nm]] <> ToString[nn]], {nn, 
     1, 12}, {nm, 1, 16}];
t8 = tab[[1 ;; 8]];

prod1 = Apply[Dot, t8];
prod1[[1, 1]] // LeafCount
4^8

(* Out[206]= 65533

Out[207]= 65536 *)

Grouping in pairs, then pairs of such, etc. has a different effect. Every consecutive product will have four commensurate pairs multiplied pairwise, and summed, to get each new element. This means we expect a factor of eight growth. But for $n$ matrices we only have $\log n$ such expansions (3, in the case above). To make the estimate a bit more accurate we start with the correct leaf count after the first round of matrix products.

LeafCount[(t8[[1]].t8[[2]])[[1, 1]]]

(* Out[227]= 13 *)

So our pairs have elements with leaf count of 13. We do two more rounfs, first taking pairs to sets of four, thence to eight. So we expect the resulting element size to be around $13\times 8^2$.

prod2 = FDot[t8];
LeafCount[prod2[[1, 1]]]
13*8^2

(* Out[233]= 877

Out[234]= 832 *)

We only took this to 8. Going to 16 would result in another factor of 8 for prod2, as opposed to a factor of $4^8$ for prod1.

For a wide variety of applications, the cost of doing a scalar product is rarely linear in the complexity of the multiplicands. Furthermore, the complexity of a product is usually larger than the complexity of the inputs. This can range from the simple case of multiplying two $n$-bit integers to get a $2n$-bit sum, to the horrible case of multiplying two sparse polynomials with $n$ terms each and getting a polynomial with $n^2$ terms.

The product of two matrices, of course, involves doing lots of scalar products, and correspondingly the product matrix has entries of greater complexity than the inputs. In your example, the scalar sums involved in taking a matrix product are increasing the complexity of your matrix entries too.

Since the costs aren't linear, the cost of chained multiplications can vary wildly depending on the order. A simple method that often gets nearly optimal performance is to try and balance the products: e.g. by always multiplying the two least complex terms of the list.

Grouping terms as you've done achieves this. Doing the products in order, however, is pretty much the worst case scenario.

Addendum: the sparse polynomial example maybe wasn't such a good choice, because in the typical case, the cost of computing the product is linear in the complexity of the output. But I think the main idea is still relevant in this case because you're doing the additions too.

Also, in the case of sparse polynomials, usually the game is to try and maintain the sparseness as long as possible, before products have so many terms they have to be considered dense. Generally there isn't much you can do to maintain sparsity, but what little you can do in a generic way amounts again to balanced products.

Why is calling Dot with many arguments so inefficient?

Tags:

Matrix

Dot

Performance Tuning

Related

Recent Posts