Can PySpark work without Spark?
PySpark has a Spark installation installed. If installed through pip3, you can find it with pip3 show pyspark
. Ex. for me it is at ~/.local/lib/python3.8/site-packages/pyspark
.
This is a standalone configuration so it can't be used for managing clusters like a full Spark installation.
As of v2.2, executing pip install pyspark
will install Spark.
If you're going to use Pyspark it's clearly the simplest way to get started.
On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars
PySpark installed by pip is a subfolder of full Spark. you can find most of PySpark python file in spark-3.0.0-bin-hadoop3.2/python/pyspark
. so if you'd like to use java or scala interface, and deploy distribute system with hadoop, you must download full Spark from Apache Spark and install it.