How do I add parallel computation to this example?
From the looks of it you are trying to approximate an integral of a function f
on the region from b
to a
using trapezoidal rule. You are right in trying to parallelize the code, but there are a couple of issues with the attempt:
- First of all, you need a work stealing scheduler in order to get any benefit, since
par
is unlikely to give you a speedup - Secondly, the way it is implemented each intermediate point
f(x)
is computed twice, except for the border pointsf(a)
andf(b)
Few moths ago I needed this functionality, so I added it the the massiv
library: trapezoidRule
, which conveninetly solves both of the above problems and avoids usage of lists.
Here is an out of the box solution, but it will not automatically parallelize the computation, since there is only one element of the array is being computed (it was designed to estimate integrals over many regions)
integrate' :: (Double -> Double) -> Double -> Double -> Double
integrate' f a b = trapezoidRule Seq P (\scale x -> f (scale x)) a d (Sz1 1) n ! 0
where
n = 1000
d = b - a
As a sanity check:
λ> integrate (\x -> x * x) 10 20 -- implementation from the question
2333.3335
λ> integrate' (\x -> x * x) 10 20
2333.3335
Here is a solution that will do the automatic parallelization and will avoid redundant evaluation:
integrateA :: Int -> (Double -> Double) -> Double -> Double -> Double
integrateA n f a b =
let step = (b - a) / fromIntegral n
sz = size segments - 1
segments = computeAs P $ A.map f (enumFromStepN Par a step (Sz (n + 1)))
area y0 y1 = step * (y0 + y1) / 2
areas = A.zipWith area (extract' 0 sz segments) (extract' 1 sz segments)
in A.sum areas
Because of list fusion, in case of your solution using lists, there will be no allocation, as such, for simple cases it will be very fast. In the above solution there is gonna be an array of size n+1
allocated in order to promote sharing and avoid double function evaluation. There will also be extra cost encountered due to scheduling, since forking off threads does not come for free. But in the end for really expensive functions and very large n
it is possible to get factor of ~x3 speed up on a quad core processor.
Below are some benchmarks of integrating gaussian function with n = 100000
:
benchmarking Gaussian1D/list
time 3.657 ms (3.623 ms .. 3.687 ms)
0.999 R² (0.998 R² .. 1.000 R²)
mean 3.627 ms (3.604 ms .. 3.658 ms)
std dev 80.50 μs (63.62 μs .. 115.4 μs)
benchmarking Gaussian1D/array Seq
time 3.408 ms (3.304 ms .. 3.523 ms)
0.987 R² (0.979 R² .. 0.994 R²)
mean 3.670 ms (3.578 ms .. 3.839 ms)
std dev 408.0 μs (293.8 μs .. 627.6 μs)
variance introduced by outliers: 69% (severely inflated)
benchmarking Gaussian1D/array Par
time 1.340 ms (1.286 ms .. 1.393 ms)
0.980 R² (0.967 R² .. 0.989 R²)
mean 1.393 ms (1.328 ms .. 1.485 ms)
std dev 263.3 μs (160.1 μs .. 385.6 μs)
variance introduced by outliers: 90% (severely inflated)
Side note suggestion. Switching to Simpson's rule will give you a better approximation. Implementation is available in massiv
;)
Edit
This is such a fun problem, that I decided to see what would it take to implement it without any array allocations. Here is what I came up with:
integrateS :: Int -> (Double -> Double) -> Double -> Double -> Double
integrateS n f a b =
let step = (b - a) / fromIntegral n
segments = A.map f (enumFromStepN Seq (a + step) step (Sz n))
area y0 y1 = step * (y0 + y1) / 2
sumWith (acc, y0) y1 =
let acc' = acc + area y0 y1
in acc' `seq` (acc', y1)
in fst $ A.foldlS sumWith (0, f a) segments
Above approach runs in constant memory, since the few arrays that do get created aren't real arrays backed by memory, but instead are delayed arrays. With a bit of trickery around fold accumulator we can share the results, thus avoiding double function evaluation. This results in astonishing speed up:
benchmarking Gaussian1D/array Seq no-alloc
time 1.788 ms (1.777 ms .. 1.799 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 1.787 ms (1.781 ms .. 1.795 ms)
std dev 23.85 μs (17.19 μs .. 31.96 μs)
The downside to the above approach is that it is not easily parallelizable, but not impossible. Embrace yourself, here is a monstrosity that can run on 8 capabilities (hardcoded and in my case 4 cores with hyperthreading):
-- | Will not produce correct results if `n` is not divisible by 8
integrateN8 :: Int -> (Double -> Double) -> Double -> Double -> Double
integrateN8 n f a b =
let k = 8
n' = n `div` k
step = (b - a) / fromIntegral n
segments =
makeArrayR D (ParN (fromIntegral k)) (Sz1 k) $ \i ->
let start = a + step * fromIntegral n' * fromIntegral i + step
in (f start, A.map f (enumFromStepN Seq (start + step) step (Sz (n' - 1))))
area y0 y1 = step * (y0 + y1) / 2
sumWith (acc, y0) y1 =
let acc' = acc + area y0 y1
in acc' `seq` (acc', y1)
partialResults =
computeAs U $ A.map (\(y0, arr) -> (y0, A.foldlS sumWith (0, y0) arr)) segments
combine (acc, y0) (y1, (acci, yn)) =
let acc' = acc + acci + area y0 y1
in acc' `seq` (acc', yn)
in fst $ foldlS combine (0, f a) partialResults
The only real array allocated is for keeping partialResults
which has a total of 16 Double
elements. Speed improvement is not as drastic, but nevertheless it is there:
benchmarking Gaussian1D/array Par no-alloc
time 960.1 μs (914.3 μs .. 1.020 ms)
0.968 R² (0.944 R² .. 0.990 R²)
mean 931.8 μs (900.8 μs .. 976.3 μs)
std dev 129.2 μs (84.20 μs .. 198.8 μs)
variance introduced by outliers: 84% (severely inflated)
my default go-to for any map
composition would be by using parmap
from Strategies
API http://hackage.haskell.org/package/parallel-3.2.2.0/docs/Control-Parallel-Strategies.html#g:7 , I'll add an example once I'm around my PC.
Edit:
You'd use parMap in the following way,
module Main where
import Control.Parallel.Strategies
main = putStrLn $ show $ integrate f 1.1 1.2
f :: Double -> Double
f x = x
integrate :: (Double -> Double) -> Double -> Double -> Double
integrate f a b =
let
step = (b - a) / 1000
segments = [a + x * step | x <- [0..999]]
area x = step * (f x + f (x + step)) / 2
in sum $ parMap rpar area segments
Then compile with:
ghc -O2 -threaded -rtsopts Main.hs
and run using the RTS + N flag to control the parallelism ./Main +RTS -N -RTS
-N can be specified e.g. -N6 to run on 6 threads or can be left empty to use all possible threads.