Convert a standard python key value dictionary list to pyspark data frame

For anyone looking for the solution to something different I found this worked for me: I have a single dictionary with key value pairs - I was looking to convert that to two PySpark dataframe columns:

{k1:v1, k2:v2 ...}

Becomes

 ---------------- 
| col1   |  col2 |
|----------------|
| k1     |  v1   |
| k2     |  v2   |
 ----------------

lol= list(map(list, mydict.items()))
df = spark.createDataFrame(lol, ["col1", "col2"])

The other answers work, but here's one more one-liner that works well with nested data. It's may not the most efficient, but if you're making a DataFrame from an in-memory dictionary, you're either working with small data sets like test data or using spark wrong, so efficiency should really not be a concern:

d = {any json compatible dict}
spark.read.json(sc.parallelize([json.dumps(d)]))

Old way:

sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]).toDF()

New way:

from pyspark.sql import Row
from collections import OrderedDict

def convert_to_row(d: dict) -> Row:
    return Row(**OrderedDict(sorted(d.items())))

sc.parallelize([{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""},{"arg1": "", "arg2": ""}]) \
    .map(convert_to_row) \ 
    .toDF()

Convert a standard python key value dictionary list to pyspark data frame

Tags:

Python

Dictionary

Apache Spark

Pyspark

Related

Recent Posts