Prevent DataFrame.partitionBy() from removing partitioned columns from schema
I'd like to add a bit more context here and provide PySpark code instead of Scala for those who need it. You need to be careful how you read in the partitioned dataframe if you want to keep the partitioned variables (the details matter). Let's start by writting a partitioned dataframe like this:
df.write.mode("overwrite").partitionBy("Season").parquet("partitioned_parquet/")
To read the whole dataframe back in WITH the partitioning variables included...
path = "partitioned_parquet/"
parquet = spark.read.parquet(path)
parquet.show()
Result:
+-----+------+
|Value|Season|
+-----+------+
| 71| 2010|
| 77| 2010|
| 83| 2010|
| 54| 2010|
| 100| 2010|
+-----+------+
only showing top 5 rows
Note that if you include an * at end of your path name, the partitioning variables will be dropped.
path = "partitioned_parquet/*"
parquet = spark.read.parquet(path)
parquet.show(5)
Result:
+-----+
|Value|
+-----+
| 71|
| 77|
| 83|
| 54|
| 100|
+-----+
only showing top 5 rows
Now, if you want to read in only portions of the partitioned dataframe, you need to use this method in order to keep your partitioning variables (in this case "Season").
path = "partitioned_parquet/"
dataframe = spark.read.option("basePath", path).parquet(path+'Season=2010/',\
path+'Season=2011/', \
path+'Season=2012/')
dataframe.show(5)
Result:
+-----+------+
|Value|Season|
+-----+------+
| 71| 2010|
| 77| 2010|
| 83| 2010|
| 54| 2010|
| 100| 2010|
+-----+------+
only showing top 5 rows
Hope that helps folks!
I can think of one workaround, which is rather lame, but works.
import spark.implicits._
val duplicated = df.withColumn("_type", $"type").withColumn("_category", $"category")
duplicated.write.partitionBy("_type", "_category").parquet(config.outpath)
I'm answering this question in hopes that someone would have a better answer or explanation than what I have (if OP has found a better solution), though, since I have the same question.