Joining a large and a ginormous spark dataframe

As I see it you have a problem of too large partitions(probably due to bigger data) You can try few approaches:

try to define spark.sql.shuffle.partitions to be something 2048 or even more(default is 200). There will be shuffle while joining your df-s. Try to play with this parameter, so that total volume of bigger data / this param will be approx 64Mb-100Mb(depends on file format). In general you should see in spark UI that each task(per partition) process "normal" amount of data(64MB-100MB max)
If first is not working I can suggest to do this join in RDD api. Convert your df into RDD. Then partition both RDDs by HashPartitioner(number of partitions). When number of partitions should be computed as I've described before.
lately new option was added by spark devs: you can bucket ginormous table into N buckets(i.e. store it ready for the join). There are few limitation present, but it can eliminate shuffling ginormous data completely. bucketBy is supported only with saveAsTable api and not save one. After you've bucketBy data and it's bucketed, then on next iteration you can load this data as external table while providing bucketing spec(see https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-table.html)

CREATE TABLE ginormous --...here you must specify schema USING PARQUET CLUSTERED BY (a,b,c) INTO N buckets LOCATION 'hdfs://your-path'

Then, when you've loaded ginormous table as bucketed one, you can load big table and repartition it to the same number of buckets and by same columns(df.repartition(N, a,b,c))

Can you try with setting, spark.shuffle.memoryFraction=0.0 ?

This will cause shuffle to spill everything to disk and you'll never get an OOM error

Joining a large and a ginormous spark dataframe

Tags:

Apache Spark

Spark Dataframe

Related

Recent Posts