How to sum the values of one column of a dataframe in spark/scala

If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)

//res1 Int = 19

You must first import the functions:

import org.apache.spark.sql.functions._

Then you can use them like this:

val df = CSV.load(args(0))
val sumSteps =  df.agg(sum("steps")).first.get(0)

You can also cast the result if needed:

val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)

Edit:

For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:

val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first

Edit2:

For dynamically applying the aggregations, the following options are available:

Applying to all numeric columns at once:

df.groupBy().sum()

Applying to a list of numeric column names:

val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)

Applying to a list of numeric column names with aliases and/or casts:

val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()

How to sum the values of one column of a dataframe in spark/scala

Tags:

Scala

Apache Spark

Related

Recent Posts