combine text from multiple rows in pyspark

One option is to use pyspark.sql.functions.collect_list() as the aggregate function.

from pyspark.sql.functions import collect_list
grouped_df = spark_df.groupby('category').agg(collect_list('name').alias("name"))

This will collect the values for name into a list and the resultant output will look like:

grouped_df.show()
#+---------+---------+
#|category |name     |
#+---------+---------+
#|A        |[A1, A2] |
#|B        |[B1, B2] |
#+---------+---------+

Update 2019-06-10: If you wanted your output as a concatenated string, you can use pyspark.sql.functions.concat_ws to concatenate the values of the collected list, which will be better than using a udf:

from pyspark.sql.functions import concat_ws

grouped_df.withColumn("name", concat_ws(", ", "name")).show()
#+---------+-------+
#|category |name   |
#+---------+-------+
#|A        |A1, A2 |
#|B        |B1, B2 |
#+---------+-------+

Original Answer: If you wanted your output as a concatenated string, you'd ~~have to~~ can use a udf. For example, you can first do the groupBy() as above and the apply a udf to join the collected list:

from pyspark.sql.functions import udf
concat_list = udf(lambda lst: ", ".join(lst), StringType())

grouped_df.withColumn("name", concat_list("name")).show()
#+---------+-------+
#|category |name   |
#+---------+-------+
#|A        |A1, A2 |
#|B        |B1, B2 |
#+---------+-------+

combine text from multiple rows in pyspark

Tags:

Pyspark

Spark Dataframe

Related

Recent Posts