How fast can we *really* multiply matrices?

There are currently no practical implications of any fast matrix multiplication algorithms besides Strassen's. The Coppersmith/Winograd algorithm and its descendants (Stothers, Williams) are very complex, depend on probabilistic constructions, etc. There's no theoretical obstacle to implementing them in the sense you're asking about, and it's something that's humanly possible, but there's little point to it and I don't believe anyone has ever actually done it. It would be complicated and painful, and the only purpose would be really learning how the algorithm works, since the cross-over point for where it would improve on the naive cubic-time algorithm is enormous (so you'll never actually see any improvement). There are other algorithms that would be somewhat easier to implement, at the cost of worse asymptotic performance, but they are also utterly impractical.

There's also a deeper issue if you try to use algebraic algorithms in practice. The algebraic complexity model typically used for these problems counts only arithmetic operations and considers memory access to be free. This made sense way back when, since a single floating point operation was comparatively expensive, but nowadays memory management can be the real bottleneck in practice. Algebraic complexity is beautiful and theoretically important, but it ignores important practical issues.

If you want to do fast matrix multiplication in practice, it will presumably be on a parallel computer. That introduces further issues of communication complexity; see http://arxiv.org/abs/1202.3173 for an analysis of the Strassen case (both theoretically and in practice).


Recently there is a PhD thesis about the practical fast matrix multiplication algorithms like Strassen:

Matrix multiplication is a core building block for numerous scientific computing and, more recently, machine learning applications. Strassen's algorithm, the original Fast Matrix Multiplication (FMM) algorithm, has long fascinated computer scientists due to its startling property of reducing the number of computations required for multiplying $n \times n$ matrices from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^{2.807})$. Over the last half century, this has fueled many theoretical improvements such as other variations of Strassen-like FMM algorithms. Previous implementations of these FMM algorithms led to the "street wisdom" that they are only practical for large, relatively square matrices, that they require considerable workspace, and that they are difficult to achieve thread-level parallelism. The thesis of this work dispels these notions by demonstrating significant benefits for small and non-square matrices, requiring no workspace beyond what is already incorporated in high-performance implementations of matrix multiplication, and achieving performance benefits on multi-core, many-core, and distributed memory architectures.

This work includes several publications:

Strassen's Algorithm Reloaded. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), Salt Lake City, UT, November 2016.

Generating Families of Practical Fast Matrix Multiplication Algorithms. In 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS17), Orlando, FL, May 29-June 2, 2017.

Strassen's Algorithm for Tensor Contraction. In SIAM Journal on Scientific Computing (SISC), 40(3):C305-C326, 2018.

Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs. FLAME Working Note #88, The University of Texas at Austin, Department of Computer Science. Technical Report TR-18-08. August 23, 2018.

The open source code repositories are here:

https://github.com/flame/fmm-gen

https://github.com/flame/tblis-strassen

Other than that, you might be also interested in this paper:

A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015).