How to compile effectively?
I'll just throw in a few random thoughts in no particular order, but this will be a rather high-level view on things. This is necessarily a subjective exposition, so treat it as such.
Typical use cases
In my opinion, Compile
as an efficiency-boosting device is effective in two kinds of situations (and their mixes):
The problem is solved most efficiently with a procedural style, because for example an efficient algorithm for it is formulated procedurally and does not have a simple / efficient functional counterpart (note also that functional programming in Mathematica is peculiar in many respects, reflecting the fact that functional layer is a thin one on top of the rule-based engine. So, some algorithms which are efficient in other functional languages may be inefficient in Mathematica). A very clear sign of it is when you have to do array indexing in a loop.
The problem can be solved by joining several
Compile
-able built-in functions together, but there are (perhaps several) "joints" where you face the performance-hit if using the top-level code, because it stays general and can not use specialized versions of these functions, and for a few other reasons. In such cases,Compile
merely makes the code more efficient by effectively type-specializing to numerical arguments and not using the main evaluator. One example that comes to mind is when we compileSelect
with a custom (compilable) predicate and can get a substantial performance boost (here is one example).
I use this rule of thumb when determining whether or not I will benefit from Compile
: the more my code inside Compile
looks like C code I'd write otherwise, the more I benefit from it (strictly speaking, this is only true for the compilation to C, not MVM).
It may happen that some portions of top-level code will be the major bottleneck and can not be recast into a more efficient form, for a given approach to the problem. In such a case, Compile
may not really help, and it is best to rethink the whole approach and try to find another formulation for the problem. In other words, it often saves time and effort to do some profiling and get a good idea about the real places where the bottlenecks are, before turning to Compile
.
Limitations of Compile
Here is an (incomplete) list of limitations, most of which you mentioned yourself
- Can only accept regular arrays (tensors) of numerical or boolean types. This excludes ragged arrays and more general Mathematica expressions.
- In most cases, can only return a single tensor of some type
- Only machine-precision arithmetic
- From the user-defined functions, only pure functions are compilable, plus one can inline other compiled functions. Rules and "functions" defined with rules are inherently not compilable.
- No way to create functions with memory (a-la static variables in C)
- Only a small subset of built-in functions can be compiled to byte-code (or C)
- Possibilities for writing recursive compiled functions seem to be very limited, and most interesting cases seem to be ruled out
- No decent pass-by-reference semantics, which is a big deal (to me anyways)
- You can not really use indexed variables in
Compile
, although it may appear that you can. - ...
Whether or not to compile to C?
I think this depends on the circumstances. Compilation to C is expensive, so this makes sense only for performance-critical code to be used many times. There are also many cases when compilation to MVM will give similar performance, while being much faster. One such example can be found in this answer, where the just-in-time compilation to MVM target led to a major speed-up, while compilation to C would have likely destroyed the purpose of it - in that particular case.
Another class of situations when compiling to C is may not be the best option is when you want to "serialize" the CompiledFunction
object, and distribute it to others, for example in a package, and you don't want to count on a C compiler being installed on the user's machine. As far as I know, there is no automatic mechanism yet to grab the generated shared library and package it together with the CompiledFunction
, and also one would have to cross-compile for all platforms and automatically dispatch to the right library to load. All this is possible but complicated, so, unless the speed gain can justify such complications for a given problem, it may be not worth it, while compilation to MVM target creates the top-level CompiledFunction
object, which is automatically cross-platform, and does not require anything (except Mathematica) to be installed.
So, it really depends, although more often than not compilation to C will lead to faster execution and, if you at all decide to use Compile
, will be justified.
What to include in Compile
I share an opinion that, unless you have some specific requirements, it is best to only use Compile
on minimal code fragments which would benefit from it the most, rather than have one big Compile
. This is good because:
- It allows you to better understand where the real bottlenecks are
- It makes your compiled code more testable and composable
- If you really need it, you can then combine these pieces and use
"InlineCompiledFunctions" -> True
option setting, to get all the benefits that one largeCompile
would give you - Since
Compile
is limited in what it can take, you will have less headaches on how to include some uncompilable pieces, plus less chances to overlook a callback to the main evaluator
That said, you may benefit from one large Compile
in some situations, including:
- Cases when you want to grab the resulting C code and use it stand-alone (linked against Wolfram RTL)
- Cases when you want to run your compiled code in parallel on several kernels and don't want to think about possible definitions distribution issues etc (this was noted by @halirutan)
Listable functions
When you can, it may be a good idea to use the RuntimeAttributes -> Listable
option, so that your code can be executed on (all or some) available cores in parallel. I will give one example which I think is rather interesting, because it represents a problem which may not initially look like one directly amenable to this (although it is surely not at all hard to realize that parallelization may work here) - computation of Pi
as a partial sum, of a well-known infinite sum representation. Here is a single-core function:
Clear[numpi1];
numpi1 =
Compile[{{nterms, _Integer}},
4*Sum[(-1)^k/(2 k + 1), {k, 0, nterms}],
CompilationTarget -> "C", RuntimeOptions -> "Speed"];
Here is a parallel version:
numpiParallelC =
Compile[{{start, _Integer}, {end, _Integer}},
4*Sum[(-1)^k/(2 k + 1), {k, start, end}], CompilationTarget -> "C",
RuntimeAttributes -> Listable, RuntimeOptions -> "Speed"];
Clear[numpiParallel];
numpiParallel[nterms_, nkernels_] :=
Total@Apply[numpiParallelC,
MapAt[# + 1 &, {Most[#], Rest[#] - 1}, {2, -1}] &@
IntegerPart[Range[0, nterms, nterms/nkernels]]];
Now, some benchmarks (on a 6-core machine):
(res0=numpiParallel[10000000,1])//AbsoluteTiming
(res1=numpiParallel[10000000,6])//AbsoluteTiming
(res2=numpi1[10000000])//AbsoluteTiming
Chop[{res0-res2,res0-res1,res2-res1}]
(*
==>
{0.0722656,3.14159}
{0.0175781,3.14159}
{0.0566406,3.14159}
{0,0,0}
*)
A few points to note here:
- It may happen that the time it takes to prepare the data to be fed into a
Listable
compiled function, will be much more than the time the function runs (e.g. when we useTranspose
orPartition
etc on huge lists), which then sort of destroys the purpose. So, it is good to make an estimate whether or not that will be the case. - A more "coarse-grained" alternative to this is to run a single-threaded compiled function in parallel on several Mathematica kernels, using the built-in parallel functionality (
ParallelEvaluate
,ParallelMap
, etc). These two possibilities are useful in different situations.
Auto-compilation
While this is not directly related to the explicit use of Compile
, this topic logically belongs here. There are a number of built-in (higher-order) functions, such as Map
, which can auto-compile. What this means is that when we execute
Map[f, list]
the function f
is analyzed by Map
, which attempts to automatically call Compile
on it (this is not done at the top-level, so using Trace
won't show an explicit call to Compile
). To benefit from this, the function f
must be compilable. As a rule of thumb, it has to be a pure function for that (which is not by itself a sufficient condition) - and generally the question of whether or not a function is compilable is answered here in the same way as for explicit Compile
. In particular, functions defined by patterns will not benefit from auto-compilation, which is something to keep in mind.
Here is a little contrived but simple example to illustrate the point:
sumThousandNumbers[n_] :=
Module[{sum = 0}, Do[sum += i, {i, n, n + 1000}]; sum]
sumThousandNumbersPF =
Module[{sum = 0}, Do[sum += i, {i, #, # + 1000}]; sum] &
Now, we try:
Map[sumThousandNumbers, Range[3000]]//Short//Timing
Map[sumThousandNumbersPF, Range[3000]]//Short//Timing
(*
==> {3.797,{501501,502502,503503,504504,505505,<<2990>>,3499496,
3500497,3501498,3502499,3503500}}
{0.094,{501501,502502,503503,504504,505505,<<2990>>,3499496,
3500497,3501498,3502499,3503500}}
*)
which shows a 40-times speedup in this particular case, due to auto-compilation.
There are in fact many cases when this is important, and not all of them are as obvious as the above example. One such case was considered in a recent answer to the question of extracting numbers from a sorted list belonging to some window. The solution is short and I will reproduce it here:
window[list_, {xmin_, xmax_}] :=
Pick[list, Boole[xmin <= # <= xmax] & /@ list, 1]
What may look like a not particularly efficient solution, is actually quite fast due to the auto-compilation of the predicate Boole[...]
inside Map
, plus Pick
being optimized on packed arrays. See the aforementioned question for more context and discussion.
This shows us another benefit of auto-compilation: not only does it often make the code run much faster, but it also does not unpack, allowing surrounding functions to also benefit from packed arrays when they can.
Which functions can auto-compile? One way to find out is to inspect SystemOptions["CompileOptions"]
:
Cases["CompileOptions"/.SystemOptions["CompileOptions"],
opt:(s_String->_)/;StringMatchQ[s,__~~"Length"]]
{"ApplyCompileLength" -> \[Infinity], "ArrayCompileLength" -> 250,
"FoldCompileLength" -> 100, "ListableFunctionCompileLength" -> 250,
"MapCompileLength" -> 100, "NestCompileLength" -> 100,
"ProductCompileLength" -> 250, "SumCompileLength" -> 250,
"TableCompileLength" -> 250}
This also tells you the threshold lengths of the list beyond which the auto-compilation is turned on. You can also change these values. Setting the value of ...CompileLength
to Infinity
is effectively disabling the auto-compilation. You can see that "ApplyCompileLength" has this value. This is because it can only compile 3 heads: Times
, Plus
, and List
. If you have one of those in your code, however, you can reset this value, to benefit from auto-compilation. Generally, the default values are pretty meaningful, so it is rarely necessary to change these defaults.
A few more techniques
There are a number of techniques involving Compile
, which are perhaps somewhat more advanced, but which sometimes allow one to solve problems for which plain Compile
is not flexible enough. Some which I am aware of:
Sometimes you can trade memory for speed, and, having a nested ragged list, pad it with zeros to form a tensor, and pass that to
Compile
.Sometimes your list is general and you can not directly process it in
Compile
to do what you want, however, you can reformulate a problem such that you can instead process a list of element positions, which are integers. I call it "element-position duality". One example of this technique in action is here, for a larger application of this idea see my last post in this thread (I hesitated to include this reference because my first several posts there are incorrect solutions. Note that for that particular problem, a far more elegant and short, but somewhat less efficient solution was given in the end of that thread).Sometimes you may need some structural operations to prepare the input data for
Compile
, and the data contains lists (or, generally, tensors), of different types (say, integer positions and real values). To keep the list packed, it may make sense to convert integers to reals (in this example), converting them back to integers withIntegerPart
insideCompile
. One such example is hereRun-time generation of compiled functions, where certain run-time parameters get embedded. This may be combined with memoization. One example is here, another very good example is here
One can emulate pass-by-reference and have a way of composing larger compiled functions out of smaller ones with parameters (well, sort of), without a loss of efficiency. This technique is showcased for example here
A common wisdom is that since neither linked-lists, nor
Sow
-Reap
are compilable, one has to pre-allocate large arrays most of the time, to store the intermediate results. There are at least two other options:- Use
Internal`Bag
, which is compilable (the problem however is that it can not be returned as a result ofCompile
as of now, AFAIK). - It is quite easy to implement an analog of a dynamic array inside your compiled code, by setting up a variable which gives the current size limit, and copy your array to a new larger array once more space is needed. In this way, you only allocate (at the end) as much space as is really needed, for a price of some overhead, which is often negligible.
- Use
One may often be able to use vectorized operations like
UnitStep
,Clip
,Unitize
etc, to replace the if-else control flow in inner loops, also insideCompile
. This may give a huge speed-up, particularly when compiling to MVM target. Some examples are in my comments in this and this blog posts, and one other pretty illustrative example of a vectorized binary search in my answer in this threadUsing additional list of integers as "pointers" to some lists you may have. Here, I will make an exception for this post, and give an explicit example, illustrating the point. The following is a fairly efficient function to find a longest increasing subsequence of a list of numbers. It was developed jointly by DrMajorBob, Fred Simons and myself, in an on and off-line MathGroup discussion (so this final form is not available publicly AFAIK, thus including it here)
Here is the code
Clear[btComp];
btComp =
Compile[{{lst, _Integer, 1}},
Module[{refs, result, endrefs = {1}, ends = {First@lst},
len = Length@lst, n0 = 1, n1 = 1, i = 1, n, e},
refs = result = 0 lst;
For[i = 2, i <= len, i++,
Which[
lst[[i]] < First@ends,
(ends[[1]] = lst[[i]]; endrefs[[1]] = i; refs[[i]] = 0),
lst[[i]] > Last@ends,
(refs[[i]] = Last@endrefs;AppendTo[ends, lst[[i]]]; AppendTo[endrefs, i]),
First@ends < lst[[i]] < Last@ends,
(n0 = 1; n1 = Length@ends;
While[n1 - n0 > 1,
n = Floor[(n0 + n1)/2];
If[ends[[n]] < lst[[i]], n0 = n, n1 = n]];
ends[[n1]] = lst[[i]];
endrefs[[n1]] = i;
refs[[i]] = endrefs[[n1 - 1]])
]];
For[i = 1; e = Last@endrefs, e != 0, (i++; e = refs[[e]]),
result[[i]] = lst[[e]]];
Reverse@Take[result, i - 1]], CompilationTarget -> "C"];
Here is an example of use (list should not contain duplicates):
test = RandomSample[#, Length[#]] &@ Union@RandomInteger[{1, 1000000}, 1000000];
btComp[test] // Length // Timing
The fastest solution based on built-ins, which is indeed very fast, is still about 6 times slower for this size of the list:
LongestCommonSequence[test, Sort@test] // Short // Timing
Anyways, the point here is that this was possible because of extra variables refs
and endrefs
, the use of which allowed to only manipulate single integers (representing positions of sub-lists in a larger list) instead of large integer lists.
A few assorted remarks
Things to watch out for: see this discussion for some tips on that. Basically, you should avoid
- Callbacks to the main evaluator
- Excessive copying of tensors (
CopyTensor
instruction) - Accidental unpacking happening in top-level functions preparing input for
Compile
or processing its output. This is not related toCompile
proper, but it happens thatCompile
does not help at all, because the bottleneck is in the top-level code.
Type conversion I would not worry about performance hit, but sometimes wrong types may lead to run-time errors, or unanticipated callbacks to
MainEvaluate
in the compiled code.Certain functions (e.g.
Sort
with the default comparison function, but not only), don't benefit from compilation much or at all.It is not clear how
Compile
handlesHold
- attributes in compiled code, but there are indications that it does not fully preserve the standard semantics we are used to in the top-level.How to see whether or not you can effectively use
Compile
for a given problem. My experience is that withCompile
in Mathematica you have to be "proactive" (with all my dislike for the word, I know of nothing better here). What I mean is that to use it effectively, you have to search the structure of your problem / program for places where you could transform the (parts of) data into a form which can be used inCompile
. In most cases (at least in my experience), except obvious ones where you already have a procedural algorithm in pseudo-code, you have to reformulate the problem, so you have to actively ask: what should I do to useCompile
here.
Setting SetSystemOptions[ "CompileOptions" -> "CompileReportExternal"->True]
will emit a message when parts of your function do not get compiled. After compilation, Needs["CompiledFunctionTools`"]
followed by CompilePrint[cF]
(with cF
the function you have compiled will display some bytecode; looking for CopyTensor
or MainEvaluate
in that helps locate inefficiencies (MainEvaluate
calls the main mma kernel, so it effectively means not compiling).
Other useful options may be found by evaluating SystemOptions["CompileOptions"]
EDIT: As suggested by Oleksandr in a comment, On[Compile::noinfo]
is also useful.
Leonid Shifrin has already given an excellent answer for the question but it's so… long and may be frustrating for someone just beginning to learn the usage of Compile
so I decided to post this as an answer.
Recently (OK… actually it's more than a year ago) I found that Ted Ersek's Mathematica tricks(.nb version can be found here) contains a brief but masterly summarization for the limitation of Compile
. Though it's written based on Version 4, the 6 rules listed by him are still valid and if you fully understand them, you can handle most compilation issues. For convenience, I'll paste the relevant part below (with modifications for part of the commentaries and examples).
BTW, based on the 6 rules, I wrote a Chinese tutorial for compilation here.
(1) Only a subset of kernel functions can be used in compiled evaluation.
We already have this post for the issue.
(2) Compiled evaluation can't use global variables.
In the next piece of code a compiled function uses a global variable width
. Later we see that the function works, but timing tests will show that it's slower than if the function was defined without using Compile
.
width=2.5;
g1 = Function[{x}, x + width];
g2 = Compile[{x}, x + width];
Do[g1[0.3], {10^6}]; // AbsoluteTiming
Do[g2[0.3], {10^6}]; // AbsoluteTiming
{0.786027, Null} {1.303057, Null}
The next piece of code defines the same function as the previous example and takes the global variable as an argument. This is the way global variables should be used in a compiled function.
g3 = Compile[{x, y}, x + y];
Do[g3[0.3, width], {10^6}]; // AbsoluteTiming
{0.415907, Null}
The next piece of code shows what happens if we try to set the value of a global variable inside a compiled function. Here again the function works, but timing tests show that the function is slower than the same function defined without using Compile
.
h1 = Function[{x}, (temp = 2.5; x + temp)];
h2 = Compile[{x}, (temp = 2.5; x + temp)];
Do[h1[0.3], {10^6}]; // AbsoluteTiming
Do[h2[0.3], {10^6}]; // AbsoluteTiming
{1.471703, Null} {2.391085, Null}
In the last example the value of temp
was set inside Compile
. When something like this is needed, Block
, Module
, or With
should be used to make temp
a local variable. Compiled functions are defined below that do the same thing as the last example using Block
, Module
, With
and they all use compiled evaluation which runs much faster. Besides the advantage in speed the approach below ensures that evaluating say ha[0.3]
, hb[0.3]
, hc[0.3]
will not change the value temp
might have outside the compiled function.
ha = Compile[{x}, Block[{temp = 2.5}, x + temp]];
hb = Compile[{x}, Module[{temp = 2.5}, x + temp]];
hc = Compile[{x}, With[{temp = 2.5}, x + temp]];
Do[ha[0.3], {10^6}]; // AbsoluteTiming
Do[hb[0.3], {10^6}]; // AbsoluteTiming
Do[hc[0.3], {10^6}]; // AbsoluteTiming
{0.364617, Null} {0.352145, Null} {0.347725, Null}
(3) Compiled evaluation can't work with patterns.
This rule can be viewed as a deduction of rule (1).
The function in next piece of code takes a list of real numbers, and determines if 0.5
is in the list. This function is fully compiled.
f4=Compile[{{lst,_Real,1}},FreeQ[lst,0.5]];
<<CompiledFunctionTools`
CompilePrint@f4
………… 1 B0 = FreeQ[ T(R1)0, T(R0)0, R1]] 2 Return
The function below also takes a list of real numbers, and determines if the list is free of negative numbers. This function works, but timing tests will show that it's no faster than the same function defined without using Compile
. The problem in this case is that the compiled function uses the pattern _?Negative
. Compiled evaluation isn't used if the second argument in Compile
includes patterns of any kind. When faced with a problem like this we should use an algorithm that doesn't require patterns, or define a function without using Compile
.
f5=Compile[{{lst,_Real,1}},FreeQ[lst,_?Negative]];
CompilePrint@f5
………… 1 R0 = MainEvaluate[ Function[{lst}, _?Negative][ T(R1)0]] 2 B0 = FreeQ[ T(R1)0, T(R0)0, R1]] 3 Return
The next piece of code shows one way the function from the last example can be written to avoid the use of patterns. This version will use compiled evaluation.
f5fixed=Compile[{{lst,_Real,1}}, FreeQ[Sign[lst],-1] ];
CompilePrint@f5fixed
………… 1 T(I1)1 = Sign[ T(R1)0] 2 B0 = FreeQ[ T(I1)1, T(I0)0]] 3 Return
Notice that function definition like f[x_] = …
or f[x_] := …
also uses pattern so they can't be compiled!
(4) Local variables in compiled evaluation must always have the same type.
The next piece of code defines a function which uses a local variable temp
where temp starts with an integer value. Later in the function temp
is changed to a real number. In this case the function works, but timing tests will show it's no faster than the same function defined without using Compile
. Local variables used in a compiled function should always have the same type (Real
, Integer
, Complex
, True|False
) to ensure compiled evaluation is used.
f6 = Compile[{x}, Module[{temp = 5}, (temp = temp + x; Round[temp])]];
CompilePrint@f6
………… 1 I1 = I0 2 R1 = I1 3 R1 = R1 + R0 4 V17I1 = MainEvaluate[ Function[{x, tempCompile$2}, Block[{temp = tempCompile$2}, {temp = temp + x, temp}]][ R0, I1]] 5 Return
So one can fix f6
in the following ways:
f6fixed1 = Compile[{x}, Module[{temp = 5}, temp = Round[temp + x]]];
CompilePrint@f6fixed1
………… 1 I1 = I0 2 R1 = I1 3 R1 = R1 + R0 4 I2 = Round[ R1] 5 I1 = I2 6 Return
f6fixed2 = Compile[{x}, Module[{temp = 5.}, temp = temp + x; Round[temp]]];
CompilePrint@f6fixed2
………… 1 R2 = R1 2 R3 = R2 + R0 3 R2 = R3 4 I1 = Round[ R2] 5 Return
(5) A compiled function can't change the value of it's argument.
The next piece of code shows an attempt to define a function. If you try to use this function you will see it doesn't work. The problem here is that the function tries to change the value of it's argument, but a compiled function can't change the value of it's argument for the same reason we can't evaluate 3=5
.
f7=Compile[{x},
Do[x=Cos[x],{4}];
x];
Some times we want to use an algorithm that starts with some value and replaces it with another value. If we want to do this in a compiled function we should initialize a local variable to the value of the functions argument, and then change the value of the local variable. The function in the next cell does this and uses compiled evaluation.
f7fixed=Compile[{x},
Module[{temp=x},
Do[temp=Cos[temp],{4}];
temp
]];
(6) Compiled evaluation can't work with all possible list structures.
The only list structures compiled evaluation can work with are vectors, matrices, and other tensors. The function defined in the next piece of code returns a list structure that doesn't meet these restrictions so compiled evaluation can't be used. In a case like this there is no advantage to using Compile to define the function.
f8=Compile[{x},{x,{2x,3x}}];
CompilePrint@f8
………… 11 T(R1)2 = MainEvaluate[ Hold[List][ R0, T(R1)1]] 12 Return