How to Convert Many CSV files to Parquet using AWS Glue
Please refer to EDIT for updated info.
S3 --> Athena
Why not you use CSV format directly with Athena?
https://docs.aws.amazon.com/athena/latest/ug/supported-format.html
CSV is one of the supported formats. Also to make it efficient, you can compress multiple CSV files for faster loading.
Supported compression,
https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html
Hope it helps.
EDIT:
Why Parquet format is more helpful than CSV?
https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and
S3 --> Glue --> Athena
More details on CSV to Parquet conversion,
https://aws.amazon.com/blogs/big-data/build-a-data-lake-foundation-with-aws-glue-and-amazon-s3/
I had the exact same situation where I wanted to efficiently loop through the catalog tables catalogued by crawler which are pointing to csv files and then convert them to parquet. Unfortunately there is not much information available in the web yet. That's why I have written a blog in LinkedIn explaining how I have done it. Please have a read; specially point #5. Hope that helps. Please let me know your feedback.
Note: As per Antti's feedback, I am pasting the excerpt solution from my blog below:
- Iterating through catalog/database/tables
The Job Wizard comes with option to run predefined script on a data source. Problem is that the data source you can select is a single table from the catalog. It does not give you option to run the job on the whole database or a set of tables. You can modify the script later anyways but the way to iterate through the database tables in glue catalog is also very difficult to find. There are Catalog APIs but lacking suitable examples. The github example repo can be enriched with lot more scenarios to help developers.
After some mucking around, I came up with the script below which does the job. I have used boto3 client to loop through the table. I am pasting it here if it comes to someone’s help. I would also like to hear from you if you have a better suggestion
import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
client = boto3.client('glue', region_name='ap-southeast-2')
databaseName = 'tpc-ds-csv'
print '\ndatabaseName: ' + databaseName
Tables = client.get_tables(DatabaseName=databaseName)
tableList = Tables['TableList']
for table in tableList:
tableName = table['Name']
print '\n-- tableName: ' + tableName
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database="tpc-ds-csv",
table_name=tableName,
transformation_ctx="datasource0"
)
datasink4 = glueContext.write_dynamic_frame.from_options(
frame=datasource0,
connection_type="s3",
connection_options={
"path": "s3://aws-glue-tpcds-parquet/"+ tableName + "/"
},
format="parquet",
transformation_ctx="datasink4"
)
job.commit()
I'm not a big fan of Glue, nor creating schemas from data
Here's how to do it in Athena, which is dramatically faster than Glue.
This is for the CSV files:
create table foo (
id int,
name string,
some date
)
row format delimited
fields terminated by ','
location 's3://mybucket/path/to/csvs/'
This is for the parquet files:
create table bar
with (
external_location = 's3://mybucket/path/to/parquet/',
format = 'PARQUET'
)
as select * from foo
You don't need to create that path for parquet, even if you use partitioning
you can convert either JSON or CSV files into parquet directly, without importing it to the catalog first.
This is for the JSON files - the below code would convert anything hosted at the rawFiles directory
import sys
from awsglue.job import Job
from awsglue.transforms import *
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.utils import getResolvedOptions
## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sparkContext = SparkContext()
glueContext = GlueContext(sparkContext)
spark = glueContext.spark_session
job = Job(glueContext) job.init(args['JOB_NAME'], args)
s3_json_path = 's3://rawFiles/'
s3_parquet_path = 's3://convertedFiles/'
output = spark.read.load(s3_json_path, format='json')
output.write.parquet(s3_parquet_path)
job.commit()