Pandas DataFrame with categorical columns from a Parquet file using read_parquet?
This is fixed in Arrow 0.15
, now the next code keeps the columns as categories (and the performance is significantly faster):
import pandas
df = pandas.DataFrame({'foo': list('aabbcc'),
'bar': list('xxxyyy')}).astype('category')
df.to_parquet('my_file.parquet')
df = pandas.read_parquet('my_file.parquet')
df.dtypes
We are having a similar problem. When working with a multi file parquet are work around is as follows: Using the Table.to_pandas() documentation the following code may be relevant:
import pyarrow.parquet as pq
dft = pq.read_table('path/to/data_parquet/', use_pandas_metadata=True)
df = dft.to_pandas(categories=['column_2'] )
the use_panadas_metadata
works for the dtype datetime64[ns]