How can I convince spark not to make an exchange when the join key is a super-set of the bucketBy key?
Based on some research and exploration this seems to be the least hacky solution:
Building on this example:
table1.join(table2, [“key1”])
.filter(table1.col(“key2”) == table2.col(“key2”))
Instead of using the equalTo (==)
from Spark, implementing a custom MyEqualTo
(by delegating to the the spark EqualTo
implementation is fine) seems to solve the issue. This way, spark won't optimize(!) the join, and it will just pull the filter up into SortMergeJoin.
Similarly, the join condition can be also formed as such:
(table1.col(“key1”) == table2.col(“key1”)) AND
table1.col(“key2”).myEqualTo(table2.col(“key2”))