Event based trigger of AWS Glue Crawler after a file is uploaded into a S3 Bucket?
As a quick start, here is a blow-by-blow account of how to create a Lambda in Python to do this. This is the first time I've created a Lambda so YMMV.
- To save time, select 'Create function' and then click 'Blueprints'. Select the sample called 's3-get-object-python' and click 'Configure'
- Fill in the Lambda name and create a new Role unless you already have one.
- The wizard will setup the S3 trigger at the same time
- Once you create it, you will need to find the Role that it created and add a new permission via a policy containing:
"Action": "glue:StartCrawler",
"Resource": "*"
- Change the code to something like:
from future import print_function
import json
import boto3
print('Loading function')
glue = boto3.client(service_name='glue', region_name='ap-southeast-2',
endpoint_url='https://glue.ap-southeast-2.amazonaws.com')
def lambda_handler(event, context):
#print("Received event: " + json.dumps(event, indent=2))
try:
glue.start_crawler(Name='my-glue-crawler')
except Exception as e:
print(e)
print('Error starting crawler')
raise e
Finally, assuming you selected that the trigger should be disabled while developing, click the S3 trigger from the designer panel and ensure it is enabled (you may need to save the lambda after making this change)
That's it, but note that an exception will be thrown if the crawler is already running so you will want to handle that if you have frequent uploads or long crawls. See: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-crawling.html#aws-glue-api-crawler-crawling-StartCrawler
(apologies if the code isn't formatted correctly - SO preview is all over the place)
No, there is currently no direct way to invoke an AWS Glue crawler in response to an upload to an S3 bucket. S3 event notifications can only be sent to:
- SNS
- SQS
- Lambda
However, it would be trivial to write a small piece of Lambda code to programmatically invoke a Glue crawler using the relevant language SDK.