Can Google Cloud Dataprep monitor a GCS path for new files?
A further update on this. Since my question a new release of Dataprep on Jan 23 2018 includes the ability to re-run dataflow jobs independently of Dataprep.
When you execute a Dataprep job it will generate a Dataflow template that you can use to trigger jobs manually in the future and it allows certain parameters to be passed in.
Steps to be able to trigger on new files (please note this is Beta so Google may change exact process):
- Create your flow and run your relevant flow/recipe. Iterate/repeat manually until you have your recipe how you want it. When you are happy run, run the job again (should be a job that appends data rather than replace since you likely want to append new content). It's probably a good idea to uncheck "Profile results" (new feature) to reduce overhead since this will be a repeatable job.
- Once complete, go to the Job details page and click Export Results button and there you should see a link to the Dataflow template. Copy the text. Note that the Dataflow template path with only be available for jobs executed after the Jan 23 2018 release since it was a new feature.
- You can then see how to trigger a dataflow job by going to DataFlow and selecting CREATE JOB FROM TEMPLATE, selecting Custom template and pasting in your template path. There you will see the parameters you can supply such as your GCS input path
- Write a Google Cloud Function that is triggered from a GCS write and using the details of the event execute the template with your file path as per step (3) above.
You can add a GCS path as a dataset by clicking on the + icon left of the folder during the dataset (see screenshot). When you set up a scheduled job for a flow that uses this dataset, all files in that directory (including new files) will be picked up on each scheduled job run.