Load a huge data from BigQuery to python/pandas/dask

Instead of querying, you can always export stuff to cloud storage -> download locally -> load into your dask/pandas dataframe:

Export + Download:

bq --location=US extract --destination_format=CSV --print_header=false 'dataset.tablename' gs://mystoragebucket/data-*.csv &&  gsutil -m cp gs://mystoragebucket/data-*.csv /my/local/dir/

Load into Dask:

>>> import dask.dataframe as dd
>>> df = dd.read_csv("/my/local/dir/*.csv")

Hope it helps.

First, you should profile your code to find out what is taking the time. Is it just waiting for big-query to process your query? Is it the download of data> What is your bandwidth, what fraction do you use? Is it parsing of that data into memory?

Since you can make SQLAlchemy support big-query ( https://github.com/mxmzdlv/pybigquery ), you could try to use dask.dataframe.read_sql_table to split your query into partitions and load/process them in parallel. In case big-query is limiting the bandwidth on a single connection or to a single machine, you may get much better throughput by running this on a distributed cluster.

Experiment!

Load a huge data from BigQuery to python/pandas/dask

Tags:

Pandas

Google Cloud Platform

Bigdata

Google Bigquery

Dask

Related

Recent Posts