How do I order fields of my Row objects in Spark (Python)
Spark >= 3.0
Fields sorting has been removed with SPARK-29748 (Remove sorting of fields in PySpark SQL Row creation Export), with exception to legacy mode, when following environmental variable is set:
PYSPARK_ROW_FIELD_SORTING_ENABLED=true
Spark < 3.0
But is there any way to prevent the Row object from ordering them?
There isn't. If you provide kwargs
arguments will sorted by name. Sorting is required for deterministic behavior, because Python before 3.6, doesn't preserve the order of the keyword arguments.
Just use plain tuples:
rdd = sc.parallelize([(1, 2)])
and pass the schema as an argument to RDD.toDF
(not to be confused with DataFrame.toDF
):
rdd.toDF(["foo", "bar"])
or createDataFrame
:
from pyspark.sql.types import *
spark.createDataFrame(rdd, ["foo", "bar"])
# With full schema
schema = StructType([
StructField("foo", IntegerType(), False),
StructField("bar", IntegerType(), False)])
spark.createDataFrame(rdd, schema)
You can also use namedtuples
:
from collections import namedtuple
FooBar = namedtuple("FooBar", ["foo", "bar"])
spark.createDataFrame([FooBar(foo=1, bar=2)])
Finally you can sort columns by select
:
sc.parallelize([Row(foo=1, bar=2)]).toDF().select("foo", "bar")
From documentation:
Row also can be used to create another Row like class, then it could be used to create Row objects
In this case order of columns is saved:
>>> FooRow = Row('foo', 'bar')
>>> row = FooRow(1, 2)
>>> spark.createDataFrame([row]).dtypes
[('foo', 'bigint'), ('bar', 'bigint')]
How to sort your original schema to match the alphabetical order of the RDD:
schema_sorted = StructType()
structfield_list_sorted = sorted(df.schema, key=lambda x: x.name)
for item in structfield_list_sorted:
schema_sorted.add(item)