Parallel Loops in C++
std::async
may be a good fit here, if you are happy to let the C++
runtime control the parallelism.
Example from the cppreference.com:
#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>
#include <future>
template <typename RAIter>
int parallel_sum(RAIter beg, RAIter end)
{
auto len = end - beg;
if(len < 1000)
return std::accumulate(beg, end, 0);
RAIter mid = beg + len/2;
auto handle = std::async(std::launch::async,
parallel_sum<RAIter>, mid, end);
int sum = parallel_sum(beg, mid);
return sum + handle.get();
}
int main()
{
std::vector<int> v(10000, 1);
std::cout << "The sum is " << parallel_sum(v.begin(), v.end()) << '\n';
}
What is your platform? You can look at OpenMP, though it's not a part of C++. But it is widely supported by compilers.
As for range-based for loops, see, e.g., Using OpenMP with C++11 range-based for loops?.
I've also seen few documents at http://www.open-std.org that indicate some efforts to incorporate parallel constructs/algorithms into future C++, but don't know what's their current status.
UPDATE
Just adding some exemplary code:
template <typename RAIter>
void loop_in_parallel(RAIter first, RAIter last) {
const size_t n = std::distance(first, last);
#pragma omp parallel for
for (size_t i = 0; i < n; i++) {
auto& elem = *(first + i);
// do whatever you want with elem
}
}
The number of threads can be set at runtime via the OMP_NUM_THREADS
environment variable.
With C++11 you can parallelize a for loop with only a few lines of codes.
My function parallel_for()
(define later in the post) splits a for loop into smaller chunks (sub loops), and each chunk assigned to a thread. Here is the usage:
/// Say you want to parallelize this:
for(int i = 0; i < nb_elements; ++i)
computation(i);
/// Then you would do:
parallel_for(nb_elements, [&](int start, int end){
for(int i = start; i < end; ++i)
computation(i);
});
My parallel_for()
also works within a class:
struct My_obj {
/// Replacing:
void sequential_for(){
for(int i = 0; i < nb_elements; ++i)
computation(i);
}
/// By:
void process_chunk(int start, int end)
{
for(int i = start; i < end; ++i)
computation(i);
}
void threaded_for(){
parallel_for(nb_elements, [this](int s, int e){
this->process_chunk(s, e);
} );
}
};
Finally here is the implementation of parallel_for()
, just paste in a header file and use it at will:
#include <algorithm>
#include <thread>
#include <functional>
#include <vector>
/// @param[in] nb_elements : size of your for loop
/// @param[in] functor(start, end) :
/// your function processing a sub chunk of the for loop.
/// "start" is the first index to process (included) until the index "end"
/// (excluded)
/// @code
/// for(int i = start; i < end; ++i)
/// computation(i);
/// @endcode
/// @param use_threads : enable / disable threads.
///
///
static
void parallel_for(unsigned nb_elements,
std::function<void (int start, int end)> functor,
bool use_threads = true)
{
// -------
unsigned nb_threads_hint = std::thread::hardware_concurrency();
unsigned nb_threads = nb_threads_hint == 0 ? 8 : (nb_threads_hint);
unsigned batch_size = nb_elements / nb_threads;
unsigned batch_remainder = nb_elements % nb_threads;
std::vector< std::thread > my_threads(nb_threads);
if( use_threads )
{
// Multithread execution
for(unsigned i = 0; i < nb_threads; ++i)
{
int start = i * batch_size;
my_threads[i] = std::thread(functor, start, start+batch_size);
}
}
else
{
// Single thread execution (for easy debugging)
for(unsigned i = 0; i < nb_threads; ++i){
int start = i * batch_size;
functor( start, start+batch_size );
}
}
// Deform the elements left
int start = nb_threads * batch_size;
functor( start, start+batch_remainder);
// Wait for the other thread to finish their task
if( use_threads )
std::for_each(my_threads.begin(), my_threads.end(), std::mem_fn(&std::thread::join));
}
Lastly you can define macros to get even more compact expression:
#define PARALLEL_FOR_BEGIN(nb_elements) parallel_for(nb_elements, [&](int start, int end){ for(int i = start; i < end; ++i)
#define PARALLEL_FOR_END()})
Now converting a sequential for:
for(int i = 0; i < nb_elements; ++i)
computation(i);
Is only a matter of doing:
PARALLEL_FOR_BEGIN(nb_edges)
{
computation(i);
}PARALLEL_FOR_END();
With the parallel algorithms in C++17 we can now use:
std::vector<std::string> foo;
std::for_each(
std::execution::par,
foo.begin(),
foo.end(),
[](auto&& item)
{
//do stuff with item
});
to compute loops in parallel. The first parameter specifies the execution policy