Merging multiple rows in a spark dataframe into a single row
You can simply group and aggregate. With data as:
val df = sc.parallelize(Seq(
(1441637160, 10.0),
(1441637170, 20.0),
(1441637180, 30.0),
(1441637210, 40.0),
(1441637220, 10.0),
(1441637230, 0.0))).toDF("timestamp", "value")
import required functions and classes:
import org.apache.spark.sql.functions.{lit, floor}
import org.apache.spark.sql.types.IntegerType
create interval column:
val tsGroup = (floor($"timestamp" / lit(60)) * lit(60))
.cast(IntegerType)
.alias("timestamp")
and use it to perform aggregation:
df.groupBy(tsGroup).agg(mean($"value").alias("value")).show
// +----------+-----+
// | timestamp|value|
// +----------+-----+
// |1441637160| 25.0|
// |1441637220| 5.0|
// +----------+-----+
First map the timestamp to the minute bucket, then use groupByKey to calculate the averages. For example:
rdd.map(x=>{val round = x._1%60; (x._1-round, x._2);})
.groupByKey
.map(x=>(x._1, (x._2.sum.toDouble/x._2.size)))
.collect()