How to parallelize this Python for loop when using Numba

Numba has been updated and prange() works now! (I'm answering my own question.)

The improvements to Numba's parallel computing capabilities are discussed in this blog post, dated December 12, 2017. Here is a relevant snippet from the blog:

Long ago (more than 20 releases!), Numba used to have support for an idiom to write parallel for loops called prange(). After a major refactoring of the code base in 2014, this feature had to be removed, but it has been one of the most frequently requested Numba features since that time. After the Intel developers parallelized array expressions, they realized that bringing back prange would be fairly easy

Using Numba version 0.36.1, I can parallelize my embarrassingly parallel for-loop using the following simple code:

@numba.jit(nopython=True, parallel=True)
def csrMult_parallel(x,Adata,Aindices,Aindptr,Ashape): 

    numRowsA = Ashape[0]    
    Ax = np.zeros(numRowsA)

    for i in numba.prange(numRowsA):
        Ax_i = 0.0        
        for dataIdx in range(Aindptr[i],Aindptr[i+1]):

            j = Aindices[dataIdx]
            Ax_i += Adata[dataIdx]*x[j]

        Ax[i] = Ax_i            

    return Ax

In my experiments, parallelizing the for-loop made the function execute about eight times faster than the version I posted at the beginning of my question, which was already using Numba, but which was not parallelized. Moreover, in my experiments the parallelized version is about 5x faster than the command Ax = A.dot(x) which uses scipy's sparse matrix-vector multiplication function. Numba has crushed scipy and I finally have a python sparse matrix-vector multiplication routine that is as fast as MATLAB.


Thanks for your quant updates, Daniel.
The following lines might be hard to swallow, but kindly believe me, there are more things to take into account. I have worked on hpc / parallel-processing / parallelism-amdahl problems
having matrices in the scales ~ N [TB]; N > 10 and their sparse accompanions, so some pieces of experience may be useful for your further views.

WARNING: Do not expect any dinner to be served for free

A wish to parallelise a piece of code sounds like a more and more often contemporary re-articulated mana. The problem is not the code, but the cost of such move.

The economy is the number one problem. Amdahl's Law, as it was originally formulated by Gene Amdahl, did not take into account the very costs of [PAR]-processes-setups + [PAR]-processes-finalisations & terminations, which indeed have to be paid in every real-world implementation.

The overhead-strict Amdahl's Law depicts the scale of these un-avoidable adverse effects and helps understand a few new aspects that have to be evaluated before one opts to introduce parallelisation ( at an acceptable cost of doing so, as it is very, indeed VERY EASY to pay MUCH more than one may gain from -- where a naive disappointment from a degraded processing performance is the easier part of the story ).

Feel free to read more posts on overhead-strict Amdahl's Law re-formulation, if willing to better understand this topic and to pre-calculate the actual "minimum"-subProblem-"size", for which the sum-of-[PAR]-overheads will get at least justified from real-world tools for introducing the parallel-split of the subProblem onto N_trully_[PAR]_processes ( not any "just"-[CONCURRENT], but true-[PARALLEL] -- these are way not equal ).


Python may get a dose of steroids for increased performance:

Python is a great prototyping eco-system, whereas numba, numpy and other compiled-extensions help a lot to boost performance way farther than a native, GIL-stepped python (co-)-processing typically delivers.

Here, you try to enforce numba.jit() to arrange the job almost-for-free, just by its automated jit()-time lexical-analyser ( that you throw your code on ), which ought both "understand" your global goal ( What to do ), and also propose some vectorisation-tricks ( How best assemble a heap of CPU-instructions for maximum efficiency of such code-execution ).

This sounds easy, but it is not.

Travis Oliphant's team has made immense progress on numba tools, but let's be realistic and fair not to expect any form of automated-wizardry to get implemented inside a .jit()-lexer + code-analysis, when trying to transform a code and assemble a more efficient flow of machine instructions to implement the high-level task's goal.

@guvectorize? Here? Seriously?

Due to [PSPACE] sizing, you may immediately forget to ask numba to somehow efficiently "stuff" the GPU-engine with data, a memory-footprint of which is way behind the GPU-GDDR sizings ( not speaking at all about too-"shallow" GPU-kernel sizes for such mathematically-"tiny" processing to just multiply, potentially in [PAR], but to later sum in [SEQ] ).

(Re-)-Loading GPU with data takes heaps of time. If having paid that, the In-GPU memory-latencies are not very friendly for "tiny"-GPU-kernels economy either -- your GPU-SMX code-execution will have to pay ~ 350-700 [ns] just to fetch a number ( most probably not automatically re-aligned for the best coalesced SM-cache-friendly re-use in next steps and you may notice, that you never, let me repeat it, NEVER re-use a single matrix cell at all, so caching per-se will not deliver anything under those 350~700 [ns] per matrix cell ), while a smart pure numpy-vectorised code can process matrix-vector product in less than 1 [ns] per cell on even the largest [PSPACE]-footprints.

That is a yardstick to compare to.

( Profiling would better show here the hard-facts, but the principle is well known beforehand, without testing how to move a few TB of data onto GPU-fabric just to realise this on one's own. )


The worst of the bad news:

Given the memory scales of the matrix A, the worse effect to be expected is, that the sparse-organisation of the storage of the matrix representation will most probably devastate most, if not all, possible performance gains achievable by numba-vectorised tricks on dense matrix representations, as there will likely be almost zero chance for efficient memory fetched cache-line re-uses and sparsity will also break any easy way to achieve a compact mapping of vectorised operations and these will hardly remain able to get easily translated into advanced CPU-hardware vector-processing resources.


Inventory of solvable problems:

  • always better pre-allocate the vector Ax = np.zeros_like( A[:,0] ) and pass it as another parameter into the numba.jit()-compiled parts of the code, so as to avoid repetitive paying additional [PTIME,PSPACE]-costs for creating ( again ) new memory-allocations ( the more if the vector is suspect for being used inside an externally orchestrated iterative optimisation process )
  • always better specify ( to narrow the universality, for the sake of resulting code performance )
    at least the numba.jit( "f8[:]( f4[:], f4[:,:], ... )" )-calling interface directives
  • always review all numba.jit()-options available and their respective default values ( may change version to version ) for your specific situation ( disabling GIL and better aligning the goals with numba + hardware capabilities will always help in numerically intensive parts of the code )

@jit(   signature = [    numba.float32( numba.float32, numba.int32 ),                                   #          # [_v41] @decorator with a list of calling-signatures
                         numba.float64( numba.float64, numba.int64 )                                    #
                         ],    #__________________ a list of signatures for prepared alternative code-paths, to avoid a deferred lazy-compilation if undefined
        nopython = False,      #__________________ forces the function to be compiled in nopython mode. If not possible, compilation will raise an error.
        nogil    = False,      #__________________ tries to release the global interpreter lock inside the compiled function. The GIL will only be released if Numba can compile the function in nopython mode, otherwise a compilation warning will be printed.
        cache    = False,      #__________________ enables a file-based cache to shorten compilation times when the function was already compiled in a previous invocation. The cache is maintained in the __pycache__ subdirectory of the directory containing the source file.
        forceobj = False,      #__________________ forces the function to be compiled in object mode. Since object mode is slower than nopython mode, this is mostly useful for testing purposes.
        locals   = {}          #__________________ a mapping of local variable names to Numba Types.
        ) #____________________# [_v41] ZERO <____ TEST *ALL* CALLED sub-func()-s to @.jit() too >>>>>>>>>>>>>>>>>>>>> [DONE]
 def r...(...):
      ...