How does BLAS get such extreme performance?

So first of all BLAS is just an interface of about 50 functions. There are many competing implementations of the interface.

Firstly I will mention things that are largely unrelated:

Fortran vs C, makes no difference
Advanced matrix algorithms such as Strassen, implementations dont use them as they dont help in practice

Most implementations break each operation into small-dimension matrix or vector operations in the more or less obvious way. For example a large 1000x1000 matrix multiplication may broken into a sequence of 50x50 matrix multiplications.

These fixed-size small-dimension operations (called kernels) are hardcoded in CPU-specific assembly code using several CPU features of their target:

SIMD-style instructions
Instruction Level Parallelism
Cache-awareness

Furthermore these kernels can be executed in parallel with respect to each other using multiple threads (CPU cores), in the typical map-reduce design pattern.

Take a look at ATLAS which is the most commonly used open source BLAS implementation. It has many different competing kernels, and during the ATLAS library build process it runs a competition among them (some are even parameterized, so the same kernel can have different settings). It tries different configurations and then selects the best for the particular target system.

(Tip: That is why if you are using ATLAS you are better off building and tuning the library by hand for your particular machine then using a prebuilt one.)

First, there are more efficient algorithms for matrix multiplication than the one you're using.

Second, your CPU can do much more than one instruction at a time.

Your CPU executes 3-4 instructions per cycle, and if the SIMD units are used, each instruction processes 4 floats or 2 doubles. (of course this figure isn't accurate either, as the CPU can typically only process one SIMD instruction per cycle)

Third, your code is far from optimal:

You're using raw pointers, which means that the compiler has to assume they may alias. There are compiler-specific keywords or flags you can specify to tell the compiler that they don't alias. Alternatively, you should use other types than raw pointers, which take care of the problem.
You're thrashing the cache by performing a naive traversal of each row/column of the input matrices. You can use blocking to perform as much work as possible on a smaller block of the matrix, which fits in the CPU cache, before moving on to the next block.
For purely numerical tasks, Fortran is pretty much unbeatable, and C++ takes a lot of coaxing to get up to a similar speed. It can be done, and there are a few libraries demonstrating it (typically using expression templates), but it's not trivial, and it doesn't just happen.

A good starting point is the great book The Science of Programming Matrix Computations by Robert A. van de Geijn and Enrique S. Quintana-Ortí. They provide a free download version.

BLAS is divided into three levels:

Level 1 defines a set of linear algebra functions that operate on vectors only. These functions benefit from vectorization (e.g. from using SSE).
Level 2 functions are matrix-vector operations, e.g. some matrix-vector product. These functions could be implemented in terms of Level1 functions. However, you can boost the performance of this functions if you can provide a dedicated implementation that makes use of some multiprocessor architecture with shared memory.
Level 3 functions are operations like the matrix-matrix product. Again you could implement them in terms of Level2 functions. But Level3 functions perform O(N^3) operations on O(N^2) data. So if your platform has a cache hierarchy then you can boost performance if you provide a dedicated implementation that is cache optimized/cache friendly. This is nicely described in the book. The main boost of Level3 functions comes from cache optimization. This boost significantly exceeds the second boost from parallelism and other hardware optimizations.

By the way, most (or even all) of the high performance BLAS implementations are NOT implemented in Fortran. ATLAS is implemented in C. GotoBLAS/OpenBLAS is implemented in C and its performance critical parts in Assembler. Only the reference implementation of BLAS is implemented in Fortran. However, all these BLAS implementations provide a Fortran interface such that it can be linked against LAPACK (LAPACK gains all its performance from BLAS).

Optimized compilers play a minor role in this respect (and for GotoBLAS/OpenBLAS the compiler does not matter at all).

IMHO no BLAS implementation uses algorithms like the Coppersmith–Winograd algorithm or the Strassen algorithm. The likely reasons are:

Maybe its not possible to provide a cache optimized implementation of these algorithms (i.e. you would loose more then you would win)
These algorithms are numerically not stable. As BLAS is the computational kernel of LAPACK this is a no-go.
Although these algorithms have a nice time complexity on paper, the Big O notation hides a large constant, so it only starts to become viable for extremely large matrices.

Edit/Update:

The new and ground breaking paper for this topic are the BLIS papers. They are exceptionally well written. For my lecture "Software Basics for High Performance Computing" I implemented the matrix-matrix product following their paper. Actually I implemented several variants of the matrix-matrix product. The simplest variants is entirely written in plain C and has less than 450 lines of code. All the other variants merely optimize the loops

    for (l=0; l<MR*NR; ++l) {
        AB[l] = 0;
    }
    for (l=0; l<kc; ++l) {
        for (j=0; j<NR; ++j) {
            for (i=0; i<MR; ++i) {
                AB[i+j*MR] += A[i]*B[j];
            }
        }
        A += MR;
        B += NR;
    }

The overall performance of the matrix-matrix product only depends on these loops. About 99.9% of the time is spent here. In the other variants I used intrinsics and assembler code to improve the performance. You can see the tutorial going through all the variants here:

ulmBLAS: Tutorial on GEMM (Matrix-Matrix Product)

Together with the BLIS papers it becomes fairly easy to understand how libraries like Intel MKL can gain such a performance. And why it does not matter whether you use row or column major storage!

The final benchmarks are here (we called our project ulmBLAS):

Benchmarks for ulmBLAS, BLIS, MKL, openBLAS and Eigen

Another Edit/Update:

I also wrote some tutorial on how BLAS gets used for numerical linear algebra problems like solving a system of linear equations:

High Performance LU Factorization

(This LU factorization is for example used by Matlab for solving a system of linear equations.)

~~I hope to find time~~ to extend the tutorial to describe and demonstrate how to realise a highly scalable parallel implementation of the LU factorization like in PLASMA.

Ok, here you go: Coding a Cache Optimized Parallel LU Factorization

P.S.: I also did make some experiments on improving the performance of uBLAS. It actually is pretty simple to boost (yeah, play on words :) ) the performance of uBLAS:

Experiments on uBLAS.

Here a similar project with BLAZE:

Experiments on BLAZE.

I don't know specfically about BLAS implementation but there are more efficient alogorithms for Matrix Multiplication that has better than O(n3) complexity. A well know one is Strassen Algorithm

How does BLAS get such extreme performance?

Tags:

C++

Fortran

Related

Recent Posts