spark.sql.crossJoin.enabled for Spark 2.x
Spark >= 3.0
spark.sql.crossJoin.enable
is true by default (SPARK-28621).
Spark >= 2.1
You can use crossJoin
:
df1.crossJoin(df2)
It makes your intention explicit and keeps more conservative configuration in place to protect you from unintended cross joins.
Spark 2.0
SQL properties can be set dynamically on runtime with RuntimeConfig.set
method so you should be able to call
spark.conf.set("spark.sql.crossJoin.enabled", true)
whenever you want to explicitly allow Cartesian product.
I think it should be
spark.conf.set("spark.sql.crossJoin.enabled", True)
Otherwise it'll give
NameError: name 'true' is not defined
The TPCDS query set benchmarks have queries that contain CROSS JOINS
and unless you explicitly write CROSS JOIN
or dynamically set Spark's default property to true Spark.conf.set("spark.sql.crossJoin.enabled", true)
you will run into an exception error.
The error appears on TPCDS queries 28,61, 88, and 90 becuase the original query syntax from Transaction Processing Committee (TPC) contains commas and Spark's default join operation is an inner join. My team has also decided to use CROSS JOIN
in lieu of changing Spark's default properties.
For changing default values of configuration settings in Dataproc, you don't even need an init action, you can use the --properties flag when creating your cluster from the command-line:
gcloud dataproc clusters create --properties spark:spark.sql.crossJoin.enabled=true my-cluster ...