SPARK DataFrame: How to efficiently split dataframe for each group based on same column values
As noted in my comments, one potentially easy approach to this problem would be to use:
df.write.partitionBy("hour").saveAsTable("myparquet")
As noted, the folder structure would be myparquet/hour=1
, myparquet/hour=2
, ..., myparquet/hour=24
as opposed to myparquet/1
, myparquet/2
, ..., myparquet/24
.
To change the folder structure, you could
- Potentially use the Hive configuration setting
hcat.dynamic.partitioning.custom.pattern
within an explicit HiveContext; more information at HCatalog DynamicPartitions. - Another approach would be to change the file system directly after you have executed the
df.write.partitionBy.saveAsTable(...)
command with something likefor f in *; do mv $f ${f/${f:0:5}/} ; done
which would remove theHour=
text from the folder name.
It is important to note that by changing the naming pattern for the folders, when you are running spark.read.parquet(...)
in that folder, Spark will not automatically understand the dynamic partitions since its missing the partitionKey (i.e. Hour
) information.