Why is Spark saveAsTable with bucketBy creating thousands of files?

In my mind also these questions poped up when I saw too many files so searched and found this

"Unlike bucketing in Apache Hive, Spark SQL creates the bucket files per the number of buckets and partitions. In other words, the number of bucketing files is the number of buckets multiplied by the number of task writers (one per partition)."

Source: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-bucketing.html

I think this answers your question why this no. of files

Your question no. 2 can be answered like If we could manage no. of partitions by repartition, provided the resource available, we can limit the files created.


Please use spark sql which will use HiveContext to write data into Hive table, so it will use the number of buckets which you have configured in the table schema.

 SparkSession.builder().
  config("hive.exec.dynamic.partition", "true").
  config("hive.exec.dynamic.partition.mode", "nonstrict").
  config("hive.execution.engine","tez").
  config("hive.exec.max.dynamic.partitions","400").
  config("hive.exec.max.dynamic.partitions.pernode","400").
  config("hive.enforce.bucketing","true").
  config("optimize.sort.dynamic.partitionining","true").
  config("hive.vectorized.execution.enabled","true").
  config("hive.enforce.sorting","true").
  enableHiveSupport().getOrCreate()

spark.sql(s"insert into hiveTableName partition (partition_column) select * from  myParquetFile")

The bucketing implementation of spark is not honoring the specified number of bucket size. Each partitions is writing into a separate files, hence you end up with lot of files for each bucket.

Please refer this link https://www.slideshare.net/databricks/hive-bucketing-in-apache-spark-with-tejas-patil

enter image description here Hope this helps.

Ravi


I was able to find a workaround (on Spark 2.1). It solves the number of files problem but might have some performance implications.

dataframe
  .withColumn("bucket", pmod(hash($"bucketColumn"), lit(numBuckets)))
  .repartition(numBuckets, $"bucket")
  .write
  .format(fmt)
  .bucketBy(numBuckets, "bucketColumn")
  .sortBy("bucketColumn")
  .option("path", "/path/to/your/table")
  .saveAsTable("table_name")

I think spark's bucketing algorithm does a positive mod of MurmurHash3 of the bucket column value. This simply replicates that logic and repartitions the data so that each partition contains all the data for a bucket.

You can do the same with partitioning + bucketing.

dataframe
  .withColumn("bucket", pmod(hash($"bucketColumn"), lit(numBuckets)))
  .repartition(numBuckets, $"partitionColumn", $"bucket")
  .write
  .format(fmt)
  .partitionBy("partitionColumn")
  .bucketBy(numBuckets, "bucketColumn")
  .sortBy("bucketColumn")
  .option("path", "/path/to/your/table")
  .saveAsTable("table_name")

Tested with 3 partitions and 5 buckets locally using csv format (both partition and bucket columns are just numbers):

$ tree .
.
├── _SUCCESS
├── partitionColumn=0
│   ├── bucket=0
│   │   └── part-00004-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00000.csv
│   ├── bucket=1
│   │   └── part-00003-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00001.csv
│   ├── bucket=2
│   │   └── part-00002-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00002.csv
│   ├── bucket=3
│   │   └── part-00004-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00003.csv
│   └── bucket=4
│       └── part-00001-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00004.csv
├── partitionColumn=1
│   ├── bucket=0
│   │   └── part-00002-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00000.csv
│   ├── bucket=1
│   │   └── part-00004-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00001.csv
│   ├── bucket=2
│   │   └── part-00002-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00002.csv
│   ├── bucket=3
│   │   └── part-00001-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00003.csv
│   └── bucket=4
│       └── part-00003-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00004.csv
└── partitionColumn=2
    ├── bucket=0
    │   └── part-00000-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00000.csv
    ├── bucket=1
    │   └── part-00001-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00001.csv
    ├── bucket=2
    │   └── part-00001-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00002.csv
    ├── bucket=3
    │   └── part-00003-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00003.csv
    └── bucket=4
        └── part-00000-c2f2b7b5-40a1-4d24-8c05-084b7a05e399_00004.csv

Here's the bucket=0 for all 3 partitions (you can see that they are all the same values):

$ paste partitionColumn=0/bucket=0/part-00004-5f860e5c-f2c2-4d52-8035-aa00e4432770_00000.csv partitionColumn=1/bucket=0/part-00002-5f860e5c-f2c2-4d52-8035-aa00e4432770_00000.csv partitionColumn=2/bucket=0/part-00000-5f860e5c-f2c2-4d52-8035-aa00e4432770_00000.csv | head
0   0   0
4   4   4
6   6   6
16  16  16
18  18  18
20  20  20
26  26  26
27  27  27
29  29  29
32  32  32

I actually liked the extra bucket index. But if you don't, you can drop the bucket column right before write and you'll get the numBuckets number of files per partition.