How to find max value in pair RDD?

Use Array.maxBy method:

val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
val maxKey = a.maxBy(_._2)
// maxKey: (String, Int) = (d,3)

or RDD.max:

val maxKey2 = rdd.max()(new Ordering[Tuple2[String, Int]]() {
  override def compare(x: (String, Int), y: (String, Int)): Int = 
      Ordering[Int].compare(x._2, y._2)
})

For Pyspark:

Let a be the pair RDD with keys as String and values as integers then

a.max(lambda x:x[1])

returns the key value pair with the maximum value. Basically the max function orders by the return value of the lambda function.

Here a is a pair RDD with elements such as ('key',int) and x[1] just refers to the integer part of the element.

Note that the max function by itself will order by key and return the max value.

Documentation is available at https://spark.apache.org/docs/1.5.0/api/python/pyspark.html#pyspark.RDD.max


Spark RDD's are more efficient timewise when they are left as RDD's and not turned into Arrays

strinIntTuppleRDD.reduce((x, y) => if(x._2 > y._2) x else y)

Use takeOrdered(1)(Ordering[Int].reverse.on(_._2)):

val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
val rdd = sc.parallelize(a)
val maxKey = rdd.takeOrdered(1)(Ordering[Int].reverse.on(_._2))
// maxKey: Array[(String, Int)] = Array((d,3))

Quoting the note from RDD.takeOrdered:

This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.