Pandas DataFrame with categorical columns from a Parquet file using read_parquet?

This is fixed in Arrow 0.15, now the next code keeps the columns as categories (and the performance is significantly faster):

import pandas

df = pandas.DataFrame({'foo': list('aabbcc'),
                       'bar': list('xxxyyy')}).astype('category')

df.to_parquet('my_file.parquet')
df = pandas.read_parquet('my_file.parquet')
df.dtypes

We are having a similar problem. When working with a multi file parquet are work around is as follows: Using the Table.to_pandas() documentation the following code may be relevant:

import pyarrow.parquet as pq
dft = pq.read_table('path/to/data_parquet/', use_pandas_metadata=True)
df = dft.to_pandas(categories=['column_2'] )

the use_panadas_metadata works for the dtype datetime64[ns]

Pandas DataFrame with categorical columns from a Parquet file using read_parquet?

Tags:

Pandas

Python 3.X

Parquet

Pyarrow

Related

Recent Posts