Joining two DataFrames in Spark SQL and selecting columns of only one

Let say you want to join on "id" column. Then you could write :

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._    
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select($"d1.*")

You could use left_semi:

d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id", "left_semi")

Semi-join takes only rows from the left dataset where joining condition is met.

There's also another interesting join type: left_anti, which works similarily to left_semi but takes only those rows where the condition is not met.

Joining two DataFrames in Spark SQL and selecting columns of only one

Tags:

Scala

Apache Spark

Apache Spark Sql

Related

Recent Posts