How to read a compressed (gz) CSV file into a dask Dataframe?
It's actually a long-standing limitation of dask. Load the files with dask.delayed
instead:
import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed
filenames = ...
dfs = [delayed(pd.read_csv)(fn) for fn in filenames]
df = dd.from_delayed(dfs) # df is a dask dataframe
Panda's current documentation says:
compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’
Since 'infer' is the default, that would explain why it is working with pandas.
Dask's documentation on the compression argument:
String like ‘gzip’ or ‘xz’. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically
That would suggest that it should also infer the compression for at least gz. That it doesn't (and it still does not in 0.15.3) may be a bug. However, it is working using compression='gzip'.
i.e.:
import dask.dataframe as dd
df = dd.read_csv("Data.gz", compression='gzip')