How to submit Spark jobs to EMR cluster from Airflow?
While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit
on (remote) EMR
via Airflow
Use
Apache Livy
- This solution is actually independent of remote server, i.e.,
EMR
- Here's an example
- The downside is that
Livy
is in early stages and itsAPI
appears incomplete and wonky to me
- This solution is actually independent of remote server, i.e.,
Use
EmrSteps
API
- Dependent on remote system:
EMR
- Robust, but since it is inherently async, you will also need an
EmrStepSensor
(alongsideEmrAddStepsOperator
) - On a single
EMR
cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)
- Dependent on remote system:
Use
SSHHook
/SSHOperator
- Again independent of remote system
- Comparatively easier to get started with
- If your
spark-submit
command involves a lot of arguments, building that command (programmatically) can become cumbersome
EDIT-1
There seems to be another straightforward way
Specifying remote
master
-IP- Independent of remote system
- Needs modifying Global Configurations / Environment Variables
- See @cricket_007's answer for details
Useful links
- This one is from @Kaxil Naik himself: Is there a way to submit spark job on different server running master
- Spark job submission using Airflow by submitting batch POST method on Livy and tracking job
- Remote spark-submit to YARN running on EMR