importing pyspark in python shell

Turns out that the pyspark bin is LOADING python and automatically loading the correct library paths. Check out $SPARK_HOME/bin/pyspark :

export SPARK_HOME=/some/path/to/apache-spark
# Add the PySpark classes to the Python path:
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

I added this line to my .bashrc file and the modules are now correctly found!


If it prints such error:

ImportError: No module named py4j.java_gateway

Please add $SPARK_HOME/python/build to PYTHONPATH:

export SPARK_HOME=/Users/pzhang/apps/spark-1.1.0-bin-hadoop2.4
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

Assuming one of the following:

  • Spark is downloaded on your system and you have an environment variable SPARK_HOME pointing to it
  • You have ran pip install pyspark

Here is a simple method (If you don't bother about how it works!!!)

Use findspark

  1. Go to your python shell

    pip install findspark
    
    import findspark
    findspark.init()
    
  2. import the necessary modules

    from pyspark import SparkContext
    from pyspark import SparkConf
    
  3. Done!!!