Spark: How to translate count(distinct(value)) in Dataframe API's
What you need is the DataFrame aggregation function countDistinct
:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
case class Log(page: String, visitor: String)
val logs = data.map(p => Log(p._1,p._2))
.toDF()
val result = logs.select("page","visitor")
.groupBy('page)
.agg('page, countDistinct('visitor))
result.foreach(println)
You can use dataframe's groupBy
command twice to do so. Here, df1
is your original input.
val df2 = df1.groupBy($"page",$"visitor").agg(count($"visitor").as("count"))
This command would produce the following result:
page visitor count
---- ------ ----
PAG2 V2 2
PAG1 V3 1
PAG1 V1 5
PAG1 V2 2
PAG2 V1 2
Then use the groupBy
command again to get the final result.
df2.groupBy($"page").agg(count($"visitor").as("count"))
Final output:
page count
---- ----
PAG1 3
PAG2 2