How to get all columns after groupby on Dataset<Row> in spark sql 2.1.0

For your solution you have to try different approach. You was almost there for solution but let me help you understand.

Dataset<Row> resultset = studentDataSet.groupBy("name").max("age");

now what you can do is you can join the resultset with studentDataSet

Dataset<Row> joinedDS = studentDataset.join(resultset, "name");

The problem with groupBy this that after applying groupBy you get RelationalGroupedDataset so it depends on what next operation you perform like sum, min, mean, max etc then the result of these operation joined with groupBy

As in you case name column is joined with the max of age so it will return only two columns but if use apply groupBy on age and then apply max on 'age' column you will get two column one is age and second is max(age).

Note :- code is not tested please make changes if needed Hope this clears you query

The accepted answer isn't ideal because it requires a join. Joining big DataFrames can cause a big shuffle that'll execute slowly.

Let's create a sample data set and test the code:

val df = Seq(
  ("bob", 20, "blah"),
  ("bob", 40, "blah"),
  ("karen", 21, "hi"),
  ("monica", 43, "candy"),
  ("monica", 99, "water")
).toDF("name", "age", "another_column")

This code should run faster with large DataFrames.

df
  .groupBy("name")
  .agg(
    max("name").as("name1_dup"), 
    max("another_column").as("another_column"),  
    max("age").as("age")
  ).drop(
    "name1_dup"
  ).show()

+------+--------------+---+
|  name|another_column|age|
+------+--------------+---+
|monica|         water| 99|
| karen|            hi| 21|
|   bob|          blah| 40|
+------+--------------+---+

How to get all columns after groupby on Dataset<Row> in spark sql 2.1.0

Tags:

Apache Spark

Apache Spark Sql

Related

Recent Posts