Building a row from a dict in pySpark

In case the dict is not flatten, you can convert dict to Row recursively.

def as_row(obj):
    if isinstance(obj, dict):
        dictionary = {k: as_row(v) for k, v in obj.items()}
        return Row(**dictionary)
    elif isinstance(obj, list):
        return [as_row(v) for v in obj]
    else:
        return obj

You can use keyword arguments unpacking as follows:

Row(**row_dict)

## Row(C0=-1.1990072635132698, C3=0.12605772684660232, C4=0.5760856026559944, 
##     C5=0.1951877800894315, C6=24.72378589441825, summary='kurtosis')

It is important to note that it internally sorts data by key to address problems with older Python versions.

This behavior is likely to be removed in the upcoming releases - see SPARK-29748 Remove sorting of fields in PySpark SQL Row creation. Once it is remove you'll have to ensure that the order of values in the dict is consistent across records.

Building a row from a dict in pySpark

Tags:

Python

Apache Spark

Pyspark

Related

Recent Posts