-n and -r arguments to IPython's %timeit magic
That number and repeat are separate arguments is because they serve different purposes. The number controls how many executions are done for each timing and it's used to get representative timings. The repeat argument controls how many timings are done and its use is to get accurate statistics. IPython uses the mean or average to calculate the run-time of the statement of all repetitions and then divides that number with number. So it measures the average of the averages. In earlier versions it used the minimum time (min()
) of all repeats and divided it by number and reported it as "best of".
To understand why there are two arguments to control the number and the repeats you have to understand what you're timing and how you can measure the time.
The granularity of the clock and the number of executions
A computer has different "clocks" to measure times. These clocks have different "ticks" (depending on the OS). For example it could measure seconds, milliseconds or nanoseconds - these ticks are called the granularity of the clock.
If the duration of the execution is smaller or roughly equal to the granularity of the clock one cannot get representative timings. Suppose your operation would take 100ns (=0.0000001 seconds) but the clock only measures milliseconds (=0.001 seconds) then most measurements would measure 0 milliseconds and a few would measure 1 millisecond - which one depends on where in the clock cycle the execution started and finished. That's not really representative of the duration of what you want to time.
This is on Windows where time.time
has a granularity of 1 millisecond:
import time
def fast_function():
return None
r = []
for _ in range(10000):
start = time.time()
fast_function()
r.append(time.time() - start)
import matplotlib.pyplot as plt
plt.title('measuring time of no-op-function with time.time')
plt.ylabel('number of measurements')
plt.xlabel('measured time [s]')
plt.yscale('log')
plt.hist(r, bins='auto')
plt.tight_layout()
This shows the histogram of the measured times from this example. Almost all measurements were 0 milliseconds and three measurements which were 1 millisecond:
There are clocks with a much lower granularity on Windows, this was just to illustrate the effect of the granularity and each clock has some granularity even if it's lower than one millisecond.
To overcome the restriction of the granularity one can increase the number of executions so the expected duration is significantly higher than the granularity of the clock. So instead of running the execution once it's run number times. Taking the numbers from above and using a number of 100 000 the expected run-time would be =0.01 seconds. So neglecting everything else the clock would now measure 10 milliseconds in almost all cases, which would accurately resemble the expected execution time.
In short specifying a number measures the sum of number executions. You need to divide the times measure this way by number again to get the "time per execution".
Other processes and the repeatitions of the execution
Your OS typically has a lot of active processes, some of them can run in parallel (different processors or using hyper-threading) but most of them run sequentially with the OS scheduling times for each process to run on the CPU. Most clocks don't care what process currently runs so the measured time will be different depending on the scheduling plan. There are also some clocks that instead of measuring system time measure the process time. However they measure the complete time of the Python process, which will sometimes includes a garbage collection or other Python threads - besides that the Python process isn't stateless and not every operation will be always exactly the same, and there are also memory allocations/re-allocations/clears happening (sometimes behind the scenes) and these memory operations times can vary depending on a lot of reasons.
Again I use a histogram measuring the time it takes to sum ten thousand ones on my computer (only using repeat and setting number to 1):
import timeit
r = timeit.repeat('sum(1 for _ in range(10000))', number=1, repeat=1_000)
import matplotlib.pyplot as plt
plt.title('measuring summation of 10_000 1s')
plt.ylabel('number of measurements')
plt.xlabel('measured time [s]')
plt.yscale('log')
plt.hist(r, bins='auto')
plt.tight_layout()
This histogram shows a sharp cutoff at just below ~5 milliseconds, which indicates that this is the "optimal" time in which the operation can be executed. The higher timings are measurements were the conditions were not optimal or other processes/threads took some of the time:
The typical approach to avoid these fluctuations is to repeat the number of timings very often and then use statistics to get the most accurate numbers. Which statistic depends on what you want to measure. I'll go into this in more detail below.
Using both number and repeat
Essentially the %timeit
is a wrapper over timeit.repeat
which is roughly equivalent to:
import timeit
timer = timeit.default_timer()
results = []
for _ in range(repeat):
start = timer()
for _ in range(number):
function_or_statement_to_time
results.append(timer() - start)
But %timeit
has some convenience features compared to timeit.repeat
. For example it calculates the best and average times of one execution based on the timings it got by repeat and number.
These are calculated roughly like this:
import statistics
best = min(results) / number
average = statistics.mean(results) / number
You could also use TimeitResult
(returned if you use the -o
option) to inspect all results:
>>> r = %timeit -o ...
7.46 ns ± 0.0788 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
>>> r.loops # the "number" is called "loops" on the result
100000000
>>> r.repeat
7
>>> r.all_runs
[0.7445439999999905,
0.7611092000000212,
0.7249667000000102,
0.7238135999999997,
0.7385598000000186,
0.7338551999999936,
0.7277425999999991]
>>> r.best
7.238135999999997e-09
>>> r.average
7.363701571428618e-09
>>> min(r.all_runs) / r.loops # calculated best by hand
7.238135999999997e-09
>>> from statistics import mean
>>> mean(r.all_runs) / r.loops # calculated average by hand
7.363701571428619e-09
General advise regarding the values of number and repeat
If you want to modify either number or repeat then you should set number to the minimum value possible without running into the granularity of the timer. In my experience number should be set so that number executions of the function take at least 10 microseconds (0.00001 seconds) otherwise you might only "time" the minimum resolution of the "timer".
The repeat should be set as high as possible. Having more repeats will make it more likely that you really find the real best or average. However more repeats will take longer so there's a trade-off as well.
IPython adjusts number but keeps repeat constant. I often do the opposite: I adjust number so that the number executions of the statement take ~10us and then I adjust the repeat that I get a good representation of the statistics (often it's in the range 100-10000). But your mileage may vary.
Which statistic is best?
The documentation of timeit.repeat
mentions this:
Note
It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.
For example one typically wants to find out how fast the algorithm can be, then one could use the minimum of these repetitions. If one is more interested in the average or median of the timings one can use those measurements. In most cases the number one is most interested in is the minimum, because the minimum resembles how fast the execution can be - the minimum is probably the one execution where the process was least interrupted (by other processes, by GC, or had the most optimal memory operations).
To illustrate the differences I repeated the above timing again but this time I included the minimum, mean, and median:
import timeit
r = timeit.repeat('sum(1 for _ in range(10000))', number=1, repeat=1_000)
import numpy as np
import matplotlib.pyplot as plt
plt.title('measuring summation of 10_000 1s')
plt.ylabel('number of measurements')
plt.xlabel('measured time [s]')
plt.yscale('log')
plt.hist(r, bins='auto', color='black', label='measurements')
plt.tight_layout()
plt.axvline(np.min(r), c='lime', label='min')
plt.axvline(np.mean(r), c='red', label='mean')
plt.axvline(np.median(r), c='blue', label='median')
plt.legend()
Contrary to this "advise" (see quoted documentation above) IPythons %timeit
reports the average instead of the min()
. However they also only use a repeat of 7 by default - which I think is too less to accurately determine the minimum - so using the average in this case is actually sensible.It's a great tool to do a "quick-and-dirty" timing.
If you need something that allows to customize it based on your needs, one could use timeit.repeat
directly or even a 3rd party module. For example:
pyperf
perfplot
simple_benchmark
(my own library)
It looks like the latest version of %timeit
is taking the average of the r
n-loop averages, not the best of the averages.
Evidently, this has changed from earlier versions of Python. The best time of r
averages can still be obtained via the TimeResults
return argument, but it is no longer the value that is displayed.
Comment : I recently ran this code from above and found that the following syntax no longer works :
n = 1
r = 50
tr = %timeit -n $n -r $r -q -o pass; compute_mean(x,np)
That is, it is no longer possible (it seems) to use $var
to pass a variable to the timeit
magic command. Does this mean that this magic command should be retired and replaced with the timeit
module?
I am using Python 3.7.4.