Joining two DataFrames in Spark SQL and selecting columns of only one
Let say you want to join on "id" column. Then you could write :
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select($"d1.*")
You could use left_semi
:
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id", "left_semi")
Semi-join takes only rows from the left dataset where joining condition is met.
There's also another interesting join type: left_anti
, which works similarily to left_semi
but takes only those rows where the condition is not met.