What is the fastest way to save a large pandas DataFrame to S3?
You can try using s3fs
with pandas
compression to upload to S3. StringIO
or BytesIO
are memory hogging.
import s3fs
import pandas as pd
s3 = s3fs.S3FileSystem(anon=False)
df = pd.read_csv("some_large_file")
with s3.open('s3://bucket/file.csv.gzip','w') as f:
df.to_csv(f, compression='gzip')
It really depends on the content, but that is not related to boto3
. Try first to dump your DataFrame
locally and see what's fastest and what size you get.
Here are some suggestions that we have found to be fast, for cases between a few MB to over 2GB (although, for more than 2GB, you really want parquet and possibly split it into a parquet dataset):
Lots of mixed text/numerical data (SQL-oriented content): use
df.to_parquet(file)
.Mostly numerical data (e.g. if your columns
df.dtypes
indicate a happynumpy
array of a single type, notObject
): you can trydf_to_hdf(file, 'key')
.
One bit of advice: try to split your df
in some shards that are meaningful to you (e.g., by time for timeseries). Especially if you have a lot of updates to a single shard (e.g. the last one in a time series), it will make your download/upload much faster.
What we have found is that, HDF5 are bulkier (uncompressed), but they save/load fantastically fast from/into memory. Parquets are by default snappy-compressed, so they tend to be smaller (depending on the entropy of your data, of course; penalty for you if you save totally random numbers).
For boto3
client, both multipart_chunksize
and multipart_threshold
are 8MB by default, which is often a fine choice. You can check via:
tc = boto3.s3.transfer.TransferConfig()
print(f'chunksize: {tc.multipart_chunksize}, threshold: {tc.multipart_threshold}')
Also, the default is to use 10 threads for each upload (which does nothing unless the size of your object is larger than the threshold above).
Another question is how to upload many files efficiently. That is not handled by any definition in TransferConfig
. But I digress, the original question is about a single object.
Use multi-part uploads to make the transfer to S3 faster. Compression makes the file smaller, so that will help too.
import boto3
s3 = boto3.client('s3')
csv_buffer = BytesIO()
df.to_csv(csv_buffer, compression='gzip')
# multipart upload
# use boto3.s3.transfer.TransferConfig if you need to tune part size or other settings
s3.upload_fileobj(csv_buffer, bucket, key)
The docs for s3.upload_fileobj
are here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_fileobj