Spark: How to translate count(distinct(value)) in Dataframe API's

What you need is the DataFrame aggregation function countDistinct:

import sqlContext.implicits._
import org.apache.spark.sql.functions._

case class Log(page: String, visitor: String)

val logs = data.map(p => Log(p._1,p._2))
            .toDF()

val result = logs.select("page","visitor")
            .groupBy('page)
            .agg('page, countDistinct('visitor))

result.foreach(println)

You can use dataframe's groupBy command twice to do so. Here, df1 is your original input.

val df2 = df1.groupBy($"page",$"visitor").agg(count($"visitor").as("count"))

This command would produce the following result:

page  visitor  count
----  ------   ----
PAG2    V2       2
PAG1    V3       1
PAG1    V1       5
PAG1    V2       2
PAG2    V1       2

Then use the groupBy command again to get the final result.

 df2.groupBy($"page").agg(count($"visitor").as("count"))

Final output:

page   count
----   ----
PAG1    3
PAG2    2

Spark: How to translate count(distinct(value)) in Dataframe API's

Tags:

Distinct

Count

Dataframe

Apache Spark

Apache Spark Sql

Related

Recent Posts