How to partition RDD by key in Spark?
How about just doing a groupByKey
using kind
. Or another PairRDDFunctions
method.
You make it seem to me that you don't really care about the partitioning, just that you get all of a specific kind in one processing flow?
The pair functions allow this:
rdd.keyBy(_.kind).partitionBy(new HashPartitioner(PARTITIONS))
.foreachPartition(...)
However, you can probably be a little safer with something more like:
rdd.keyBy(_.kind).reduceByKey(....)
or mapValues
or a number of the other pair functions that guarantee you get the pieces as a whole
Would it be correct to partition an RDD[DeviceData] by overwriting the deviceData.hashCode() method and use only the hashcode of kind?
It wouldn't be. If you take at the Java Object.hashCode
documentation you'll find following information about the general contract of hashCode
:
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
So unless notion of equality based purely on a kind
of device fits your use case, and I seriously doubt it does, tinkering with HashCode
to get desired partitioning is a bad idea. In general case you should implement your own partitioner but here it is not required.
Since, excluding specialized scenarios in SQL and GraphX, partitionBy
is valid only on PairRDD
it makes sense to create RDD[(String, DeviceData)]
and use plain HashPartitioner
deviceDataRdd.map(dev => (dev.kind, dev)).partitionBy(new HashPartitioner(n))
Just keep in mind that in a situation where kind
has low cardinality or highly skewed distribution using it for partitioning may be not an optimal solution.