Profiling a Scala Spark application
I've written an article and a script recently, that wraps spark-submit
, and generates a flame graph after executing a Spark application.
Here's the article: https://www.linkedin.com/pulse/profiling-spark-applications-one-click-michael-spector
Here's the script: https://raw.githubusercontent.com/spektom/spark-flamegraph/master/spark-submit-flamegraph
Just use it instead of regular spark-submit
.
I would recommend you to use directly the UI that spark provides. It provides a lot of information and metrics regarding time, steps, network usage, etc...
You can check more about it here: https://spark.apache.org/docs/latest/monitoring.html
Also, in the new Spark version (1.4.0) there is a nice visualizer to understand the steps and stages of your spark jobs.
As you said, profiling a distributed process is trickier than profiling a single JVM process, but there are ways to achieve this.
You can use sampling as a thread profiling method. Add a java agent to the executors that will capture stack traces, then aggregate over these stack traces to see which methods your application spends the most time in.
For example, you can use Etsy's statsd-jvm-profiler java agent and configure it to send the stack traces to InfluxDB and then aggregate them using Flame Graphs.
For more information, check out my post on profiling Spark applications: https://www.paypal-engineering.com/2016/09/08/spark-in-flames-profiling-spark-applications-using-flame-graphs/