Extracting `Seq[(String,String,String)]` from spark DataFrame
Well, it doesn't claim that it is a tuple. It claims it is a struct
which maps to Row
:
import org.apache.spark.sql.Row
case class Feature(lemma: String, pos_tag: String, ne_tag: String)
case class Record(id: Long, content_processed: Seq[Feature])
val df = Seq(
Record(1L, Seq(
Feature("ancient", "jj", "o"),
Feature("olympia_greece", "nn", "location")
))
).toDF
val content = df.select($"content_processed").rdd.map(_.getSeq[Row](0))
You'll find exact mapping rules in the Spark SQL programming guide.
Since Row
is not exactly pretty structure you'll probably want to map it to something useful:
content.map(_.map {
case Row(lemma: String, pos_tag: String, ne_tag: String) =>
(lemma, pos_tag, ne_tag)
})
or:
content.map(_.map ( row => (
row.getAs[String]("lemma"),
row.getAs[String]("pos_tag"),
row.getAs[String]("ne_tag")
)))
Finally a slightly more concise approach with Datasets
:
df.as[Record].rdd.map(_.content_processed)
or
df.select($"content_processed").as[Seq[(String, String, String)]]
although this seems to be slightly buggy at this moment.
There is important difference the first approach (Row.getAs
) and the second one (Dataset.as
). The former one extract objects as Any
and applies asInstanceOf
. The latter one is using encoders to transform between internal types and desired representation.