Are there MapReduce implementations on GPUs (CUDA)?

Years ago I have implemented cumar.

As I was using Mac OS X and 'nvcc' compiler was not happy with Apple's 'clang', I designed this library pure C++ ( and a flavor of lambda ).

A typical map operation looks like this:

//A = B + C, all of length 'n'
cumar::map()("[](double a&, double b, double c){ a = b+c; }" )(A, A+n, B, C);

For reduce operation, it looks like this:

// x = min(A), A of size 'n'
cumar::reduce()( "[](double a, double b){ return a < b ? a : b; }" )(A, A+n);

At present, the easiest interface is provided by thrust::reduce.

As you noted, there is also Mars.