Partition data for efficient joining for Spark dataframe/dataset
It is possible using the DataFrame/DataSet API using the repartition
method. Using this method you can specify one or multiple columns to use for data partitioning, e.g.
val df2 = df.repartition($"colA", $"colB")
It is also possible to at the same time specify the number of wanted partitions in the same command,
val df2 = df.repartition(10, $"colA", $"colB")
Note: this does not guarantee that the partitions for the dataframes will be located on the same node, only that the partitioning is done in the same way.
You can repartition
a DataFrame after loading it if you know you'll be joining it multiple times
val users = spark.read.load("/path/to/users").repartition('userId)
val joined1 = users.join(addresses, "userId")
joined1.show() // <-- 1st shuffle for repartition
val joined2 = users.join(salary, "userId")
joined2.show() // <-- skips shuffle for users since it's already been repartitioned
So it'll shuffle the data once and then reuse the shuffle files when joining subsequent times.
However, if you know you'll be repeatedly shuffling data on certain keys, your best bet would be to save the data as bucketed tables. This will write the data out already pre-hash partitioned, so when you read the tables in and join them you avoid the shuffle. You can do so as follows:
// you need to pick a number of buckets that makes sense for your data
users.bucketBy(50, "userId").saveAsTable("users")
addresses.bucketBy(50, "userId").saveAsTable("addresses")
val users = spark.read.table("users")
val addresses = spark.read.table("addresses")
val joined = users.join(addresses, "userId")
joined.show() // <-- no shuffle since tables are co-partitioned
In order to avoid a shuffle, the tables have to use the same bucketing (e.g. same number of buckets and joining on the bucket columns).