Pandas : Reading first n rows from parquet file?
After exploring around and getting in touch with the pandas dev team, the end point is pandas does not support argument nrows
or skiprows
while reading the parquet file.
The reason being that pandas use pyarrow
or fastparquet
parquet engines to process parquet file and pyarrow
has no support for reading file partially or reading file by skipping rows (not sure about fastparquet
). Below is the link of issue on pandas github for discussion.
https://github.com/pandas-dev/pandas/issues/24511
The accepted answer is out of date. It is now possible to read only the first few lines of a parquet file into pandas, though it is a bit messy and backend dependent.
To read using PyArrow as the backend, follow below:
from pyarrow.parquet import ParquetFile
import pyarrow as pa
pf = ParquetFile('file_name.pq')
first_ten_rows = next(pf.iter_batches(batch_size = 10))
df = pa.Table.from_batches([first_ten_rows]).to_pandas()
Change the line batch_size = 10
to match however many rows you want to read in.