PyArrow: Store list of dicts in parquet using nested types

According to this Jira issue, reading and writing nested Parquet data with a mix of struct and list nesting levels was implemented in version 2.0.0.

The following example demonstrates the implemented functionality by doing a round trip: pandas data frame -> parquet file -> pandas data frame. PyArrow version used is 3.0.0.

The initial pandas data frame has one filed of type list of dicts and one entry:

                  field
0  [{'a': 1}, {'a': 2}]

Example code:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet

df = pd.DataFrame({'field': [[{'a': 1}, {'a': 2}]]})
schema = pa.schema(
    [pa.field('field', pa.list_(pa.struct([('a', pa.int64())])))])
table_write = pa.Table.from_pandas(df, schema=schema, preserve_index=False)
pyarrow.parquet.write_table(table_write, 'test.parquet')
table_read = pyarrow.parquet.read_table('test.parquet')
table_read.to_pandas()

The output data frame is the same as the input data frame, as it should be:

                  field
0  [{'a': 1}, {'a': 2}]

PyArrow: Store list of dicts in parquet using nested types

Tags:

Python

Pandas

Parquet

Pyarrow

Related

Recent Posts