Fastest file format for read/write operations with Pandas and/or Numpy
Recently pandas added support for the parquet format using as backend the library pyarrow
(written by Wes Mckinney himself, with his usual obsession for performance).
You only need to install the pyarrow
library and use the methods read_parquet
and to_parquet
. Parquet is much faster to read and write for bigger datasets (above a few hundred megabytes or more) and it also keep track of dtype metadata, so you won't loose data type information when writing and reading from disk. It can actually store more efficiently some datatypes that HDF5 are not very performant with (like strings and timestamps: HDF5 doesn't have a native data type for those, so it uses pickle to serialize them, which makes slow for big datasets).
Parquet is also a columnar format, which makes it very easy to do two things:
Fastly filter out columns that you're not interested in. With CSV you have to actually read the whole file and only after that you can throw away columns you don't want. With parquet you can actualy read only the columns you're interested.
Make queries filtering out rows and reading only what you care.
Another interesting recent development is the Feather file format, which is also developed by Wes Mckinney. It's essentially just an uncompressed arrow
format written directly to disk, so it is potentially faster to write than the Parquet format. The disadvantage will be files that are 2-3x larger.
Use HDF5. Beats writing flat files hands down. And you can query. Docs are here
Here's a perf comparison vs SQL. Updated to show SQL/HDF_fixed/HDF_table/CSV write and read perfs.
Docs now include a performance section:
See here