Are there MapReduce implementations on GPUs (CUDA)?
Years ago I have implemented cumar.
As I was using Mac OS X and 'nvcc' compiler was not happy with Apple's 'clang', I designed this library pure C++ ( and a flavor of lambda ).
A typical map operation looks like this:
//A = B + C, all of length 'n'
cumar::map()("[](double a&, double b, double c){ a = b+c; }" )(A, A+n, B, C);
For reduce operation, it looks like this:
// x = min(A), A of size 'n'
cumar::reduce()( "[](double a, double b){ return a < b ? a : b; }" )(A, A+n);
At present, the easiest interface is provided by thrust::reduce.
As you noted, there is also Mars.