spark pivot without aggregation
There isn't a good way to pivot without aggregating in Spark, basically it assumes that you would just use a OneHotEncoder for that functionality, but that lacks the human readability of a straight pivot. The best ways that I have found to do it are:
val pivot = countryKPI
.groupBy("country_id3", "value")
.pivot("indicator_id", Seq("a", "b"))
.agg(first(col("indicator_id")))
pivot.show
+-----------+-----+----+----+
|country_id3|value| a| b|
+-----------+-----+----+----+
| ABC| 8|null| b|
| POL| 9| a|null|
| POL| 7|null| b|
| ABC| 7| a|null|
+-----------+-----+----+----+
However, if (country_id3, value)
is not distinct within the dataset, then you collapse rows and potentially be taking a somewhat meaningless first()
value from your pivot col.
An alternative is to add an id column to the dataset, group on that new id, pivot your desired column, then join back to the original dataset. Here's an example:
val countryWithId = countryKPI.withColumn("id", monotonically_increasing_id)
val pivotted = countryWithId
.groupBy("id")
.pivot("indicator_id")
.agg(first(col("indicator_id")))
val pivot2 = countryWithId.join(pivotted, Seq("id")).drop("id") //.drop("indicator_id")
pivot2.show
+-----------+------------+-----+----+----+
|country_id3|indicator_id|value| a| b|
+-----------+------------+-----+----+----+
| ABC| a| 7| a|null|
| ABC| b| 8|null| b|
| POL| a| 9| a|null|
| POL| b| 7|null| b|
+-----------+------------+-----+----+----+
In this case, you still have the original pivot column, but you can .drop()
that as well if you prefer.
The following snippet seems to work - but I am not sure if an aggregation by avg is correct -even though "fitting numbers" are the output.
countryKPI.groupBy("country_id3").pivot("indicator_id").avg("value").show
I'm not sure if this is "inefficient" for a bigger amount of data (avg) compared to just reusing the values (as I do not want to aggregate).