How to write parquet file from pandas dataframe in S3 in python
First ensure that you have pyarrow or fastparquet installed with pandas.
Then install boto3 and aws cli. Use aws cli to set up the config and credentials files, located at .aws folder.
Here is a simple script using pyarrow, and boto3 to create a temporary parquet file and then send to AWS S3.
Sample code excluding imports:
def main():
data = {0: {"data1": "value1"}}
df = pd.DataFrame.from_dict(data, orient='index')
write_pandas_parquet_to_s3(
df, "bucket", "folder/test/file.parquet", ".tmp/file.parquet")
def write_pandas_parquet_to_s3(df, bucketName, keyName, fileName):
# dummy dataframe
table = pa.Table.from_pandas(df)
pq.write_table(table, fileName)
# upload to s3
s3 = boto3.client("s3")
BucketName = bucketName
with open(fileName) as f:
object_data = f.read()
s3.put_object(Body=object_data, Bucket=BucketName, Key=keyName)
For python 3.6+, AWS has a library called aws-data-wrangler that helps with the integration between Pandas/S3/Parquet
to install do;
pip install awswrangler
if you want to write your pandas dataframe as a parquet file to S3 do;
import awswrangler as wr
wr.s3.to_parquet(
dataframe=df,
path="s3://my-bucket/key/my-file.parquet"
)
the below function gets parquet output in a buffer and then write buffer.values() to S3 without any need to save parquet locally
Also, since you're creating an s3 client you can create credentials using aws s3 keys that can be either stored locally, in an airflow connection or aws secrets manager
def dataframe_to_s3(s3_client, input_datafame, bucket_name, filepath, format):
if format == 'parquet':
out_buffer = BytesIO()
input_datafame.to_parquet(out_buffer, index=False)
elif format == 'csv':
out_buffer = StringIO()
input_datafame.to_parquet(out_buffer, index=False)
s3_client.put_object(Bucket=bucket_name, Key=filepath, Body=out_buffer.getvalue())
S3_client is nothing but a boto3 client object.Hope this helps!
courtesy- https://stackoverflow.com/a/40615630/12036254
For your reference, I have the following code works.
s3_url = 's3://bucket/folder/bucket.parquet.gzip'
df.to_parquet(s3_url, compression='gzip')
In order to use to_parquet
, you need pyarrow
or fastparquet
to be installed. Also, make sure you have correct information in your config
and credentials
files, located at .aws
folder.
Edit: Additionally, s3fs
is needed. see https://stackoverflow.com/a/54006942/1862909