AWS EMR Spark Python Logging
I'm using emr-5.30.1 running in YARN client mode and got this working using the Python logging
library.
I didn't like solutions which used the JVM private methods in Spark. Apart from being a private method these caused my application logs to appear in the Spark logs (which are already quite verbose) and furthermore force me to use Spark's logging format.
Sample code using logging
:
import logging
logging.basicConfig(
format="""%(asctime)s,%(msecs)d %(levelname)-8s[%(filename)s:%(funcName)s:%(lineno)d] %(message)s""",
datefmt="%Y-%m-%d %H:%M:%S",
level=logging.INFO,
)
if __name__ == '__main__':
logging.info('test')
...
When the cluster is created, I specify LogUri='s3://mybucket/emr/'
via the console / CLI / boto.
Log output appears in stdout.gz
of the relevant step, which can be found using either of the below options.
In the EMR Console choose your Cluster. On the "Summary" tab, click the tiny folder icon next to "Log URI". Within the popup, navigate to steps, choose your step id, and open
stdout.gz
In S3 navigate to the logs directly. They are located at
emr/j-<cluster-id>/steps/s-<step-id>/stdout.gz
inmybucket
.
I've found that EMR's logging for particular steps almost never winds up in the controller or stderr logs that get pulled alongside the step in the AWS console.
Usually I find what I want in the job's container logs (and usually it's in stdout).
These are typically at a path like s3://mybucket/logs/emr/spark/j-XXXXXX/containers/application_XXXXXXXXX/container_XXXXXXX/...
. You might need to poke around within the various application_...
and container_...
directories within containers
.
That last container directory should have a stdout.log
and stderr.log
.
For what it worth. Let j-XXX
be the ID of the EMR cluster and assume it is configured to use logs_bucket
for persisting logs on S3. If you want to find the logs emitted by your code do the following:
- In AWS console, find the step which you want to review
- Go to is
stderr
and search forapplication_
. Take a note of the full name you find, it should be something likeapplication_15489xx175355_0yy5
. - Go to
s3://logs_bucket/j-XXX/containers
and find the folderapplication_15489xx175355_0yy5
. - In this folder, you will find at least one folder named
application_15489xx175355_0yy5_ww_vvvv
. In these folders you will find files namedstderr.gz
which contain the logs emitted by your code.