pyspark -- best way to sum values in column of type Array(Integer())

You can use a higher-order SQL function AGGREGATE (reduce from functional programming), like this:

import pyspark.sql.functions as F
df = df.select(
  'name',
  F.expr('AGGREGATE(scores, 0, (acc, x) -> acc + x)').alias('Total')
)

First argument is the array column, second is initial value (should be of same type as the values you sum, so you may need to use "0.0" or "DOUBLE(0)" etc if your inputs are not integers) and third argument is a lambda function, which adds each element of the array to an accumulator variable (in the beginning this will be set to the initial value 0).

The transformation will run in a single projection operator, thus will be very efficient. Also you do not need to know the size of the arrays in advance and the array can have different length on each row.

In case you don't know the length of the array (as in your example):

import pyspark.sql.functions as F
psaudo_counts = df.select('name').distinct().withColumn('score', F.lit(0))
df = df.select('name', F.explode('scores').alias('score')).unionByName(psaudo_counts)
df = df.groupby('name').agg(F.sum('name').alias('Total'))

In case you do know the length of the array:

import pyspark.sql.functions as F
length_of_array = 3
df = df.select('name', sum([F.col('scores').getItem(i) for i in range(length_of_array)]).alias('Total'))

Thanks for cricket_007 for the tip and this old mail for the fixed length idea

pyspark -- best way to sum values in column of type Array(Integer())

Tags:

Apache Spark

Pyspark

Apache Spark Sql

Spark Dataframe

Related

Recent Posts