How to speed up auxilary DoolittleDecomposite function?
It's pretty clear that the complexity of this function is $\text{O}(n^3)$, since we're iterating over each element of a matrix (a factor of $n^2$) and taking a sum at each one (an additional factor of $n$). However, we can't really do anything to reduce the complexity; LU decomposition has the same complexity as matrix multiplication. Although algorithms faster than $\text{O}(n^3)$ do exist I do not think they are used in practice.
The upshot of all this is that if we really want to increase our speed, we should Compile
our function. Here is an algorithm in Fortran that is in-place; that is, it doesn't use any extra memory (although we do need to copy the matrix once so that we can 'edit' it). We can write a CompiledFunction
to implement this version of the algorithm:
doolittleDecomposition = Compile[{{a, _Real, 2}},
With[{n = Length[a]},
Module[{t = a},
Do[
Do[
Do[
t[[i, j]] -= t[[i, k]] t[[k, j]];,
{k, 1, i - 1}];,
{j, i, n}];
Do[
Do[
t[[j, i]] -= t[[j, k]] t[[k, i]];,
{k, 1, i - 1}];
t[[j, i]] /= t[[i, i]];,
{j, i + 1, n}];,
{i, 1, n}];
t
]
],
{{q, _Real}}
, RuntimeOptions -> {"Speed", "EvaluateSymbolically" -> False},
CompilationTarget -> "C",
CompilationOptions -> {"ExpressionOptimization" -> True}]
We should double-check that our function doesn't call MainEvaluate
, which is slow:
Needs["CompiledFunctionTools`"]
doolittleDecomposition // CompilePrint
While doing this check I discovered that I originally wrote t[j, i]
instead of t[[j, i]]
, causing MainEvaluate
to be called. Always double-check! In fact, let's now double-check that both algorithms give the same result:
testReal = RandomReal[1, {200, 200}];
d1 = doolittleDecomposite[testReal]; // AbsoluteTiming
d2 = doolittleDecomposition[testReal]; // AbsoluteTiming
d1 == d2
This also shows that the Compile
d version is faster.
Update: Refactoring the original code
If you think Compile
is too drastic (or you don't have a working C compiler) there are some other changes we can make.
- Don't use an auxiliary or 'helper' function. This will incur extra overhead that you don't need. Unlike a compiled language where the compiler will essentially inline the function, Mathematica will be passing-by-value all of your matrices!
- Don't use
Nest
. If we're getting rid of the helper function,Nest
doesn't really make sense. We can replace it with aDo
-loop instead. TheDo
-loop can 'automatically' incrementk
, instead of having to pass it (in a list!) to a function and have it returned (also in a list!!). At this point we can renameintermediate
totemp
, consolidating two variables that really contain the same thing, and we also can lose the extra redefinition ofrow
inside the loop. - Use
Dot
instead ofSum
. In general, if there's a builtin function for something, that function is faster. In this case you want to pairwise multiply two lists and take the sum of the resulting elements. We usually call that a dot product. Instead of loopingm
, we can useSpan
to extract a list usingPart
, and thenDot
our two lists together. This is where most of the speedup comes from.
At this point the code looks pretty much like the Fortran function, but with the inner loops handled by Dot
:
doolittleDecomposite2[mat_?MatrixQ] :=
Module[{temp = ConstantArray[0, Dimensions@mat], row = Length@mat},
Do[
Do[temp[[k, j]] =
mat[[k, j]] - temp[[k, ;; k - 1]].temp[[;; k - 1, j]],
{j, k, row}];
Do[temp[[i, k]] =
(mat[[i, k]] - temp[[i, ;; k - 1]].temp[[;; k - 1, k]]) / temp[[k, k]],
{i, k + 1, row}];,
{k, row}];
temp
]
Updated performance analysis
I ran another comparison, this time including LUDecomposition
, just to see how well we did:
The first thing I noticed (even before I generated the plots) was that the original solution takes an unreasonably long time on matrices larger than around $n>250$ (I tried $n=400$ and let it run for at least an hour before aborting manually). I suspect this has to do with memory bandwidth when passing temp
into Sum
.
The second think to notice is that LUDecomposition
is fast. Not only is it at least two orders of magnitude faster than mine, but below $n=1000$ it is almost instant (less than 10 ms). The former is probably from using highly optimized libraries like the the Intel MKL. As for the latter, it's probably not a coincidence that the 'fast' runs all have memory usage below the size of my CPU cache (8MB):
The third thing to notice is that while all of our solutions are firmly $\text{O}(n^3)$ for large numbers, the native LUDecomposition
is actually closer to $\text{O}(n^{2.4})$, meaning that there is probably a large $\text{O}(n^2)$ factor that the leading $n^3$ won't dominate until very large $n$.
Similarly, although the refactored version appears to be slightly better than $\text{O}(n^3)$, this is only because of all the overhead; as the vectors passed to Dot
get larger, the $n^3$ term will eventually dominate.
Looking back at the memory usage plot, we can see that the original function doesn't use much more memory than mine. All are asymptotically $\text{O}(n^2)$. The reason the original takes so long is because Mathematica is essentially pass-by-value, so it's making a huge malloc
and free
every time it calls the helper function.
The refactored version uses almost exactly the same amount of memory as my Compile
d solution (excepting a small $\text{O}(n)$ contribution from the vectors and Dot
) because we avoid all that argument-passing and copying.
As to why I'm using almost exactly 3× more memory than LUDecomposition
, it's probably also due to pass-by-value semantics: Mathematica constructs a version of the matrix to pass to the CompiledFunction
, which then makes it's own working copy, which is then copied again to produce the result. As LUDecomposition
is a native function, it can avoid all that.
The function doolittleDecomposite2
refactored by @2012rcampion can use Span(;;)
to avoid the inner Do
loop
doolittle[mat_?MatrixQ] :=
Module[
{temp = ConstantArray[1, Dimensions@mat], row = Length@mat},
Do[
temp[[k, k ;; row]] =
mat[[k, k ;; row]] - temp[[k, ;; k - 1]].temp[[;; k - 1, k ;; row]];
temp[[k + 1 ;; row, k]] =
(mat[[k + 1 ;; row, k]] -
temp[[k + 1 ;; row, ;; k - 1]].temp[[;; k - 1, k]])/temp[[k, k]];,
{k, row}];
temp
]
Or
LU[mat_] :=
Module[{n, L, U},
n = Length@mat;
L = IdentityMatrix[n];
U = mat;
Do[
L[[k ;; n, k]] = U[[k ;; n, k]]/U[[k, k]];
U[[(k + 1) ;; n, k ;; n]] =
U[[(k + 1) ;; n, k ;; n]] - L[[(k + 1) ;; n, {k}]].U[[{k}, k ;; n]];
, {k, 1, n - 1}];
{L, U}
]
(*=========================*)
mat = RandomInteger[10, {5, 5}]
LUDecomposition@mat
$ \left( \begin{array}{ccccc} 1 & 0 & 0 & 0 & 0 \\ \frac{4}{7} & 1 & 0 & 0 & 0 \\ 1 & -\frac{14}{5} & 1 & 0 & 0 \\ 1 & -\frac{49}{20} & \frac{219}{176} & 1 & 0 \\ \frac{5}{7} & -\frac{1}{2} & \frac{5}{88} & -\frac{270}{1303} & 1 \\ \end{array} \right),\left( \begin{array}{ccccc} 7 & 9 & 6 & 0 & 8 \\ 0 & \frac{20}{7} & -\frac{17}{7} & 10 & -\frac{25}{7} \\ 0 & 0 & -\frac{44}{5} & 34 & -14 \\ 0 & 0 & 0 & -\frac{1303}{88} & \frac{939}{88} \\ 0 & 0 & 0 & 0 & -\frac{4552}{1303} \\ \end{array} \right)$