Weird bug in Pandas and Numpy regarding multithreading

Looking at numpy, it looks like, under the hood it has had on/off issues with multithreading, and depending on what version you are using you may expect to may start to see crashes when you bump up ne.set_vml_num_threads() ..

http://numpy-discussion.10968.n7.nabble.com/ANN-NumExpr-2-7-0-Release-td47414.html

I need to get my head around how this is glued in to the python interpreter, given your code example where it seems to be somehow allowing multiple apparently synchronous/ordered calls to np.sqrt() to proceed in parallel. I guess if python interpreter is always just returning a reference to an object when it pops the stack, and in your example is just pitching those references and not assigning or manipulating them in any way it would be fine. But if subsequent loop iterations depend on previous ones then it seems less clear how these could be safely parallelized. Arguably silent failure / wrong results is an outcome worse than crashes.


Pandas uses numexpr under the hood to calculate some operations, and numexpr sets the maximal number of threads for vml to 1, when it is imported:

# The default for VML is 1 thread (see #39)
set_vml_num_threads(1)

and it gets imported by pandas when df+df is evaluated in expressions.py:

from pandas.core.computation.check import _NUMEXPR_INSTALLED

if _NUMEXPR_INSTALLED:
   import numexpr as ne

However, Anaconda distribution also uses vml-functionality for such functions as sqrt, sin, cos and so on - and once numexpr set the maximal number of vml-threads to 1, the numpy-functions no longer use parallelization.

The problem can be easily seen in gdb (using your slow script):

>>> gdb --args python slow.py
(gdb) b mkl_serv_domain_set_num_threads
function "mkl_serv_domain_set_num_threads" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (mkl_serv_domain_set_num_threads) pending.
(gbd) run
Thread 1 "python" hit Breakpoint 1, 0x00007fffee65cd70 in mkl_serv_domain_set_num_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
(gdb) bt 
#0  0x00007fffee65cd70 in mkl_serv_domain_set_num_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
#1  0x00007fffe978026c in _set_vml_num_threads(_object*, _object*) () from /home/ed/anaconda37/lib/python3.7/site-packages/numexpr/interpreter.cpython-37m-x86_64-linux-gnu.so
#2  0x00005555556cd660 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:694
...
(gdb) print $rdi
$1 = 1

i.e. we can see, numexpr sets number of threads to 1. Which is later used when vml-sqrt function is called:

(gbd) b mkl_serv_domain_get_max_threads
Breakpoint 2 at 0x7fffee65a900
(gdb) (gdb) c
Continuing.

Thread 1 "python" hit Breakpoint 2, 0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
(gdb) bt
#0  0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
#1  0x00007ffff01fcea9 in mkl_vml_serv_threader_d_1i_1o () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
#2  0x00007fffedf78563 in vdSqrt () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_lp64.so
#3  0x00007ffff5ac04ac in trivial_two_operand_loop () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-x86_64-linux-gnu.so

So we can see numpy uses vml's implementation of vdSqrt which utilizes mkl_vml_serv_threader_d_1i_1o to decide whether calculation should be done in parallel and it looks the number of threads:

(gdb) fin
Run till exit from #0  0x00007fffee65a900 in mkl_serv_domain_get_max_threads () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
0x00007ffff01fcea9 in mkl_vml_serv_threader_d_1i_1o () from /home/ed/anaconda37/lib/python3.7/site-packages/numpy/../../../libmkl_intel_thread.so
(gdb) print $rax
$2 = 1

the register %rax has the maximal number of threads and it is 1.

Now we can use numexpr to increase the number of vml-threads, i.e.:

import numpy as np
import numexpr as ne
import pandas as pd
df=pd.DataFrame(np.random.random((10,10)))
df+df

#HERE: reset number of vml-threads
ne.set_vml_num_threads(8)

x=np.random.random(1000000)
for i in range(10000):
    np.sqrt(x)     # now in parallel

Now multiple cores are utilized!