Where is strassen's matrix multiplication useful?

Generally Strassen’s Method is not preferred for practical applications for following reasons.

  1. The constants used in Strassen’s method are high and for a typical application Naive method works better.
  2. For Sparse matrices, there are better methods especially designed for them.
  3. The submatrices in recursion take extra space.
  4. Because of the limited precision of computer arithmetic on noninteger values, larger errors accumulate in Strassen’s algorithm than in Naive Method.

So the idea of strassen's algorithm is that it's faster (asymptotically speaking). This could make a big difference if you are dealing with either huge matrices or else a very large number of matrix multiplications. However, just because it's faster asymptotically doesn't make it the most efficient algorithm practically. There are all sorts of implementation considerations such as caching and architecture specific quirks. Also there is also parallelism to consider.

I think your best bet would be to look at the common libraries and see what they are doing. Have a look at BLAS for example. And I think that Matlab uses MAGMA.


If your contention is that you don't think O(n^2.8) is that much faster than O(n^3) this chart shows you that n doesn't need to be very large before that difference becomes significant.

enter image description here


It's very important to stop at the right moment.

With 1,000 x 1,000 matrices, you can multiply them by doing seven 500 x 500 products plus a few additions. That's probably useful. With 500 x 500, maybe. With 10 x 10 matrices, most likely not. You'd just have to do some experiments first at what point to stop.

But Strassen's algorithm only saves a factor 2 (at best) when the number of rows grows by a factor 32, the number of coefficients grows by 1,024, and the total time grows by a factor 16,807 instead of 32,768. In practice, that's a "constant factor". I'd say you gain more by transposing the second matrix first so you can multiply rows by rows, then look carefully at cache sizes, vectorise as much as possible, and distribute over multiple cores that don't step on each others' feet.