How to concatenate multiple columns into single column (with no prior knowledge on their number)?
TL;DR Use struct
function with Dataset.columns
operator.
Quoting the scaladoc of struct function:
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
There are two variants: string-based for column names or using Column
expressions (that gives you more flexibility on the calculation you want to apply on the concatenated columns).
From Dataset.columns:
columns: Array[String] Returns all column names as an array.
Your case would then look as follows:
scala> df.withColumn("newCol",
struct(df.columns.head, df.columns.tail: _*)).
show(false)
+----------+-----------+---------+-------+----+--------------------------+
|agentName |original_dt|parsed_dt|user |text|newCol |
+----------+-----------+---------+-------+----+--------------------------+
|qwertyuiop|0 |0 |16102.0|0 |[qwertyuiop,0,0,16102.0,0]|
+----------+-----------+---------+-------+----+--------------------------+
I think this works perfect for your case here is with an example
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(
Seq(
("qwertyuiop", 0, 0, 16102.0, 0)
)
).toDF("agentName","original_dt","parsed_dt","user","text")
val result = data.withColumn("newCol", split(concat_ws(";", data.schema.fieldNames.map(c=> col(c)):_*), ";"))
result.show()
+----------+-----------+---------+-------+----+------------------------------+
|agentName |original_dt|parsed_dt|user |text|newCol |
+----------+-----------+---------+-------+----+------------------------------+
|qwertyuiop|0 |0 |16102.0|0 |[qwertyuiop, 0, 0, 16102.0, 0]|
+----------+-----------+---------+-------+----+------------------------------+
Hope this helped!