Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]

Spark >= 3.0.0

Since Spark 3.0 you can use vector_to_array

import org.apache.spark.ml.functions.vector_to_array

testDF.select(vector_to_array($"scaledFeatures").alias("_tmp")).select(exprs:_*)

Spark < 3.0.0

One possible approach is something similar to this

import org.apache.spark.sql.functions.udf

// In Spark 1.x you'll will have to replace ML Vector with MLLib one
// import org.apache.spark.mllib.linalg.Vector
// In 2.x the below is usually the right choice
import org.apache.spark.ml.linalg.Vector

// Get size of the vector
val n = testDF.first.getAs[Vector](0).size

// Simple helper to convert vector to array<double> 
// asNondeterministic is available in Spark 2.3 or befor
// It can be removed, but at the cost of decreased performance
val vecToSeq = udf((v: Vector) => v.toArray).asNondeterministic

// Prepare a list of columns to create
val exprs = (0 until n).map(i => $"_tmp".getItem(i).alias(s"f$i"))

testDF.select(vecToSeq($"scaledFeatures").alias("_tmp")).select(exprs:_*)

If you know a list of columns upfront you can simplify this a little:

val cols: Seq[String] = ???
val exprs = cols.zipWithIndex.map{ case (c, i) => $"_tmp".getItem(i).alias(c) }

For Python equivalent see How to split Vector into columns - using PySpark.

Alternate solution that evovled couple of days ago: Import the VectorDisassembler into your project (as long as it's not merged into Spark), now:

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

val dataset = spark.createDataFrame(
  Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
).toDF("id", "val1", "val2")


val assembler = new VectorAssembler()
  .setInputCols(Array("val1", "val2"))
  .setOutputCol("vectorCol")

val output = assembler.transform(dataset)
output.show()
/*
+---+----+----+---------+
| id|val1|val2|vectorCol|
+---+----+----+---------+
|  0| 1.2| 1.3|[1.2,1.3]|
|  1| 2.2| 2.3|[2.2,2.3]|
|  2| 3.2| 3.3|[3.2,3.3]|
+---+----+----+---------+*/

val disassembler = new org.apache.spark.ml.feature.VectorDisassembler()
  .setInputCol("vectorCol")
disassembler.transform(output).show()
/*
+---+----+----+---------+----+----+
| id|val1|val2|vectorCol|val1|val2|
+---+----+----+---------+----+----+
|  0| 1.2| 1.3|[1.2,1.3]| 1.2| 1.3|
|  1| 2.2| 2.3|[2.2,2.3]| 2.2| 2.3|
|  2| 3.2| 3.3|[3.2,3.3]| 3.2| 3.3|
+---+----+----+---------+----+----+*/

Please try VectorSlicer :

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

val dataset = spark.createDataFrame(
  Seq((1, 0.2, 0.8), (2, 0.1, 0.9), (3, 0.3, 0.7))
).toDF("id", "negative_logit", "positive_logit")


val assembler = new VectorAssembler()
  .setInputCols(Array("negative_logit", "positive_logit"))
  .setOutputCol("prediction")

val output = assembler.transform(dataset)
output.show()
/*
+---+--------------+--------------+----------+
| id|negative_logit|positive_logit|prediction|
+---+--------------+--------------+----------+
|  1|           0.2|           0.8| [0.2,0.8]|
|  2|           0.1|           0.9| [0.1,0.9]|
|  3|           0.3|           0.7| [0.3,0.7]|
+---+--------------+--------------+----------+
*/

val slicer = new VectorSlicer()
.setInputCol("prediction")
.setIndices(Array(1))
.setOutputCol("positive_prediction")

val posi_output = slicer.transform(output)
posi_output.show()

/*
+---+--------------+--------------+----------+-------------------+
| id|negative_logit|positive_logit|prediction|positive_prediction|
+---+--------------+--------------+----------+-------------------+
|  1|           0.2|           0.8| [0.2,0.8]|              [0.8]|
|  2|           0.1|           0.9| [0.1,0.9]|              [0.9]|
|  3|           0.3|           0.7| [0.3,0.7]|              [0.7]|
+---+--------------+--------------+----------+-------------------+
*/

Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]

Tags:

Scala

Apache Spark

Apache Spark Sql

Apache Spark Ml

Related

Recent Posts