How to split a list to multiple columns in Pyspark?

It depends on the type of your "list":

If it is of type ArrayType():

df = hc.createDataFrame(sc.parallelize([['a', [1,2,3]], ['b', [2,3,4]]]), ["key", "value"])
df.printSchema()
df.show()
root
 |-- key: string (nullable = true)
 |-- value: array (nullable = true)
 |    |-- element: long (containsNull = true)

you can access the values like you would with python using []:

df.select("key", df.value[0], df.value[1], df.value[2]).show()
+---+--------+--------+--------+
|key|value[0]|value[1]|value[2]|
+---+--------+--------+--------+
|  a|       1|       2|       3|
|  b|       2|       3|       4|
+---+--------+--------+--------+

+---+-------+
|key|  value|
+---+-------+
|  a|[1,2,3]|
|  b|[2,3,4]|
+---+-------+

If it is of type StructType(): (maybe you built your dataframe by reading a JSON)

df2 = df.select("key", psf.struct(
        df.value[0].alias("value1"), 
        df.value[1].alias("value2"), 
        df.value[2].alias("value3")
    ).alias("value"))
df2.printSchema()
df2.show()
root
 |-- key: string (nullable = true)
 |-- value: struct (nullable = false)
 |    |-- value1: long (nullable = true)
 |    |-- value2: long (nullable = true)
 |    |-- value3: long (nullable = true)

+---+-------+
|key|  value|
+---+-------+
|  a|[1,2,3]|
|  b|[2,3,4]|
+---+-------+

you can directly 'split' the column using *:

df2.select('key', 'value.*').show()
+---+------+------+------+
|key|value1|value2|value3|
+---+------+------+------+
|  a|     1|     2|     3|
|  b|     2|     3|     4|
+---+------+------+------+

I'd like to add the case of sized lists (arrays) to pault answer.

In the case that our column contains medium sized arrays (or large sized ones) it is still possible to split them in columns.

from pyspark.sql.types import *          # Needed to define DataFrame Schema.
from pyspark.sql.functions import expr   

# Define schema to create DataFrame with an array typed column.
mySchema = StructType([StructField("V1", StringType(), True),
                       StructField("V2", ArrayType(IntegerType(),True))])

df = spark.createDataFrame([['A', [1, 2, 3, 4, 5, 6, 7]], 
                            ['B', [8, 7, 6, 5, 4, 3, 2]]], schema= mySchema)

# Split list into columns using 'expr()' in a comprehension list.
arr_size = 7
df = df.select(['V1', 'V2']+[expr('V2[' + str(x) + ']') for x in range(0, arr_size)])

# It is posible to define new column names.
new_colnames = ['V1', 'V2'] + ['val_' + str(i) for i in range(0, arr_size)] 
df = df.toDF(*new_colnames)

The result is:

df.show(truncate= False)

+---+---------------------+-----+-----+-----+-----+-----+-----+-----+
|V1 |V2                   |val_0|val_1|val_2|val_3|val_4|val_5|val_6|
+---+---------------------+-----+-----+-----+-----+-----+-----+-----+
|A  |[1, 2, 3, 4, 5, 6, 7]|1    |2    |3    |4    |5    |6    |7    |
|B  |[8, 7, 6, 5, 4, 3, 2]|8    |7    |6    |5    |4    |3    |2    |
+---+---------------------+-----+-----+-----+-----+-----+-----+-----+

How to split a list to multiple columns in Pyspark?

Tags:

Pyspark

Apache Spark Sql

Related

Recent Posts