How can I sum multiple columns in a spark dataframe in pyspark?
Try this:
df = df.withColumn('result', sum(df[col] for col in df.columns))
df.columns
will be list of columns from df.
[TL;DR,]
You can do this:
from functools import reduce
from operator import add
from pyspark.sql.functions import col
df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
Explanation:
The df.na.fill(0)
portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:
df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
If you have static list of columns, you can do this:
df.withColumn("result", col("col1") + col("col2") + col("col3"))
But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3")
iteratively. For this, you can use the reduce
method with add
function to get this:
reduce(add, [col(x) for x in df.columns])
The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3")
instead of col("col1") + col("col2") + col("col3")
. But the effect would be same.
The col(x)
ensures that you are getting col(col("col1") + col("col2")) + col("col3")
instead of a simple string concat (which generates (col1col2col3
).
Add multiple columns from a list into one column
I tried a lot of methods and the following are my observations:
- PySpark's
sum
function doesn't support column addition (Pyspark version 2.3.1) - Built-in python's
sum
function is working for some folks but giving error for others.
So, the addition of multiple columns can be achieved using the expr
function in PySpark, which takes an expression to be computed as an input.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns. We can also use any other complex expression to get other output.