AWS Glue issue with double quote and commas
Look like you also need to add escapeChar
. AWS Athena docs shows this example:
CREATE EXTERNAL TABLE myopencsvtable (
col1 string,
col2 string,
col3 string,
col4 string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://location/of/csv/';
I do this to solve:
1 - Create a Crawler that don't overwrite the target table properties, I used boto3 for this but it can be created in AWS console to, Do this (change de xxx-var):
import boto3
client = boto3.client('glue')
response = client.create_crawler(
Name='xxx-Crawler-Name',
Role='xxx-Put-here-your-rol',
DatabaseName='xxx-databaseName',
Description='xxx-Crawler description if u need it',
Targets={
'S3Targets': [
{
'Path': 's3://xxx-Path-to-s3/',
'Exclusions': [
]
},
]
},
SchemaChangePolicy={
'UpdateBehavior': 'LOG',
'DeleteBehavior': 'LOG'
},
Configuration='{ \
"Version": 1.0, \
"CrawlerOutput": { \
"Partitions": {"AddOrUpdateBehavior": "InheritFromTable" \
}, \
"Tables": {"AddOrUpdateBehavior": "MergeNewColumns" } \
} \
}'
)
# run the crawler
response = client.start_crawler(
Name='xxx-Crawler-Name'
)
2 - Edit the serialization lib, I do this in AWS Console like say this post (https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#schema-csv-quotes) just change this:
3 - Run Crawler again. Run the crawler as always do:
4 - That's it, your 2nd run should not change any data in the table, it's just for testing that it's works ¯\_(ツ)_/¯.