Adding custom jars to pyspark in jupyter notebook

I've managed to get it working from within the jupyter notebook which is running form the all-spark container.

I start a python3 notebook in jupyterhub and overwrite the PYSPARK_SUBMIT_ARGS flag as shown below. The Kafka consumer library was downloaded from the maven repository and put in my home directory /home/jovyan:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = 
  '--jars /home/jovyan/spark-streaming-kafka-assembly_2.10-1.6.1.jar pyspark-shell'

import pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext

sc = pyspark.SparkContext()
ssc = StreamingContext(sc,1)

broker = "<my_broker_ip>"
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"],
                        {"metadata.broker.list": broker})
directKafkaStream.pprint()
ssc.start()

Note: Don't forget the pyspark-shell in the environment variables!

Extension: If you want to include code from spark-packages you can use the --packages flag instead. An example on how to do this in the all-spark-notebook can be found here

Indeed, there is a way to link it dynamically via the SparkConf object when you create the SparkSession, as explained in this answer:

spark = SparkSession \
    .builder \
    .appName("My App") \
    .config("spark.jars", "/path/to/jar.jar,/path/to/another/jar.jar") \
    .getOrCreate()

Adding custom jars to pyspark in jupyter notebook

Tags:

Python 3.X

Jupyter Notebook

Apache Kafka

Pyspark

Spark Streaming

Related

Recent Posts