How to find max value in pair RDD?
Use Array.maxBy
val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
val maxKey = a.maxBy(_._2)
// maxKey: (String, Int) = (d,3)
or RDD.max
val maxKey2 = rdd.max()(new Ordering[Tuple2[String, Int]]() {
override def compare(x: (String, Int), y: (String, Int)): Int =
Ordering[Int].compare(x._2, y._2)
For Pyspark:
Let a
be the pair RDD with keys as String and values as integers then
a.max(lambda x:x[1])
returns the key value pair with the maximum value. Basically the max function orders by the return value of the lambda function.
Here a
is a pair RDD with elements such as ('key',int)
and x[1]
just refers to the integer part of the element.
Note that the max
function by itself will order by key and return the max value.
Documentation is available at
Spark RDD's are more efficient timewise when they are left as RDD's and not turned into Arrays
strinIntTuppleRDD.reduce((x, y) => if(x._2 > y._2) x else y)
Use takeOrdered(1)(Ordering[Int].reverse.on(_._2))
val a = Array(("a",1), ("b",2), ("c",1), ("d",3))
val rdd = sc.parallelize(a)
val maxKey = rdd.takeOrdered(1)(Ordering[Int].reverse.on(_._2))
// maxKey: Array[(String, Int)] = Array((d,3))
Quoting the note from RDD.takeOrdered:
This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.