How to install pySpark locally

1. Prerequisite

  • Check java version first

  • Then, just install miniconda3 and create virtual environment based on python 3.6

    java -version
    # The command above might show something like below
    >> openjdk version "1.8.0_262"
    >> OpenJDK Runtime Environment (build 1.8.0_262-b10)
    >> OpenJDK 64-Bit Server VM (build 25.262-b10, mixed mode)
    

2. Download Spark

# Download spark
wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz

# unzip 
tar -xvzf spark-3.0.1-bin-hadoop2.7.tgz

# move to home and rename
mv spark-3.0.1-bin-hadoop2.7 ~/spark

3. Install pyspark

pip install pyspark

4. Change the execution path for pyspark

export SPARK_HOME="/your_home_directory/spark/"
export PATH="$SPARK_HOME/bin:$PATH"

5. Test

$ pyspark

6. PySpark in Jupyter

# add below lines in .bashrc 
export PYSPARK_DRIVER_PYTHON=jupyter 
export PYSPARK_DRIVER_PYTHON_OPTS='notebook' 
$ pyspark 

# >> Now it will give you a url for jupyter notebook
# >> [I 17:34:45.744 NotebookApp] Serving notebooks from local directory: /home/jupyter
# >> [I 17:34:45.744 NotebookApp] Jupyter Notebook 6.1.5 is running at:
# >> [I 17:34:45.744 NotebookApp] http://host:port/
# >> [I 17:34:45.744 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).