AWS Glue: How to add a column with the source filename in the output?
You can do it with spark in your etl job:
var df = glueContext.getCatalogSource(
database = database,
tableName = table,
transformationContext = s"source-$database.$table"
).getDynamicFrame()
.toDF()
.withColumn("input_file_name", input_file_name())
glueContext.getSinkWithFormat(
connectionType = "s3",
options = JsonOptions(Map(
"path" -> args("DST_S3_PATH")
)),
transformationContext = "",
format = "parquet"
).writeDynamicFrame(DynamicFrame(df, glueContext))
Remember it works with getCatalogSource() API only and not with create_dynamic_frame_from_options()
With an AWS Glue Python auto-generated script, I've added the following lines:
from pyspark.sql.functions import input_file_name
## Add the input file name column
datasource1 = datasource0.toDF().withColumn("input_file_name", input_file_name())
## Convert DataFrame back to DynamicFrame
datasource2 = datasource0.fromDF(datasource1, glueContext, "datasource2")
Then, in the ApplyMapping
or datasink
portions of the code, you reference datasource2
.