In Spark Dataframe how to get duplicate records and distinct records in two dataframes?
Depending on the version of spark you have, you could use window functions in datasets/sql like below:
Dataset<Row> New = df.withColumn("Duplicate", count("*").over( Window.partitionBy("id") ) );
Dataset<Row> Dups = New.filter(col("Duplicate").gt(1));
Dataset<Row> Uniques = New.filter(col("Duplicate").equalTo(1));
the above is written in java. should be similar in scala and read this on how to do in python. https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
val acctDF = List(("1", "Acc1"), ("1", "Acc1"), ("1", "Acc1"), ("2", "Acc2"), ("2", "Acc2"), ("3", "Acc3")).toDF("AcctId", "Details")
scala> acctDF.show()
+------+-------+
|AcctId|Details|
+------+-------+
| 1| Acc1|
| 1| Acc1|
| 1| Acc1|
| 2| Acc2|
| 2| Acc2|
| 3| Acc3|
+------+-------+
val countsDF = acctDF.map(rec => (rec(0), 1)).reduceByKey(_+_).map(rec=> (rec._1.toString, rec._2)).toDF("AcctId", "AcctCount")
val accJoinedDF = acctDF.join(countsDF, acctDF("AcctId")===countsDF("AcctId"), "left_outer").select(acctDF("AcctId"), acctDF("Details"), countsDF("AcctCount"))
scala> accJoinedDF.show()
+------+-------+---------+
|AcctId|Details|AcctCount|
+------+-------+---------+
| 1| Acc1| 3|
| 1| Acc1| 3|
| 1| Acc1| 3|
| 2| Acc2| 2|
| 2| Acc2| 2|
| 3| Acc3| 1|
+------+-------+---------+
val distAcctDF = accJoinedDF.filter($"AcctCount"===1)
scala> distAcctDF.show()
+------+-------+---------+
|AcctId|Details|AcctCount|
+------+-------+---------+
| 3| Acc3| 1|
+------+-------+---------+
val duplAcctDF = accJoinedDF.filter($"AcctCount">1)
scala> duplAcctDF.show()
+------+-------+---------+
|AcctId|Details|AcctCount|
+------+-------+---------+
| 1| Acc1| 3|
| 1| Acc1| 3|
| 1| Acc1| 3|
| 2| Acc2| 2|
| 2| Acc2| 2|
+------+-------+---------+
(OR scala> duplAcctDF.distinct.show() )
df.groupBy($"field1",$"field2"...).count.filter($"count" > 1).show()