Split 1 column into 3 columns in spark scala
To do this programmatically, you can create a sequence of expressions with (0 until 3).map(i => col("temp").getItem(i).as(s"col$i"))
(assume you need 3 columns as result) and then apply it to select
with : _*
syntax:
df.withColumn("temp", split(col("columnToSplit"), "\\.")).select(
(0 until 3).map(i => col("temp").getItem(i).as(s"col$i")): _*
).show
+----+----+----+
|col0|col1|col2|
+----+----+----+
| a| b| c|
| d| e| f|
+----+----+----+
To keep all columns:
df.withColumn("temp", split(col("columnToSplit"), "\\.")).select(
col("*") +: (0 until 3).map(i => col("temp").getItem(i).as(s"col$i")): _*
).show
+-------------+---------+----+----+----+
|columnToSplit| temp|col0|col1|col2|
+-------------+---------+----+----+----+
| a.b.c|[a, b, c]| a| b| c|
| d.e.f|[d, e, f]| d| e| f|
+-------------+---------+----+----+----+
If you are using pyspark
, use a list comprehension to replace the map
in scala:
df = spark.createDataFrame([['a.b.c'], ['d.e.f']], ['columnToSplit'])
from pyspark.sql.functions import col, split
(df.withColumn('temp', split('columnToSplit', '\\.'))
.select(*(col('temp').getItem(i).alias(f'col{i}') for i in range(3))
).show()
+----+----+----+
|col0|col1|col2|
+----+----+----+
| a| b| c|
| d| e| f|
+----+----+----+
Try:
import sparkObject.spark.implicits._
import org.apache.spark.sql.functions.split
df.withColumn("_tmp", split($"columnToSplit", "\\.")).select(
$"_tmp".getItem(0).as("col1"),
$"_tmp".getItem(1).as("col2"),
$"_tmp".getItem(2).as("col3")
)
The important point to note here is that the sparkObject
is the SparkSession object you might have already initialized. So, the (1) import statement has to be compulsorily put inline within the code, not before the class definition.