AWS Glue: How to handle nested JSON with varying schemas
The procedure I found useful to shallow nested json:
ApplyMapping for the first level as
datasource0
;Explode
struct
orarray
objects to get rid of element leveldf1 = datasource0.toDF().select(id,col1,col2,...,explode(coln).alias(coln)
, whereexplode
requiresfrom pyspark.sql.functions import explode
;Select the JSON objects that you would like to keep intact by
intact_json = df1.select(id, itct1, itct2,..., itctm)
;Transform
df1
back to dynamicFrame and Relationalize the dynamicFrame as well as drop the intact columns bydataframe.drop_fields(itct1, itct2,..., itctm)
;Join relationalized table with the intact table based on 'id' column.
I'm not sure you can do this with a table definition, but you can accomplish this with an ETL job by using a mapping function to cast the top level values as JSON strings. Documentation: [link]
import json
# Your mapping function
def flatten(rec):
for key in rec:
rec[key] = json.dumps(rec[key])
return rec
old_df = glueContext.create_dynamic_frame.from_options(
's3',
{"paths": ['s3://...']},
"json")
# Apply mapping function f to all DynamicRecords in DynamicFrame
new_df = Map.apply(frame=old_df, f=flatten)
From here you have the option of exporting to S3 (perhaps in Parquet or some other columnar format to optimize for querying) or directly into Redshift from my understanding, although I haven't tried it.