1. Prerequisite
-
Check java version first
-
Then, just install miniconda3 and create virtual environment based on python 3.6
java -version # The command above might show something like below >> openjdk version "1.8.0_262" >> OpenJDK Runtime Environment (build 1.8.0_262-b10) >> OpenJDK 64-Bit Server VM (build 25.262-b10, mixed mode)
2. Download Spark
# Download spark
wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
# unzip
tar -xvzf spark-3.0.1-bin-hadoop2.7.tgz
# move to home and rename
mv spark-3.0.1-bin-hadoop2.7 ~/spark
3. Install pyspark
pip install pyspark
4. Change the execution path for pyspark
export SPARK_HOME="/your_home_directory/spark/"
export PATH="$SPARK_HOME/bin:$PATH"
5. Test
$ pyspark
6. PySpark in Jupyter
# add below lines in .bashrc
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
$ pyspark
# >> Now it will give you a url for jupyter notebook
# >> [I 17:34:45.744 NotebookApp] Serving notebooks from local directory: /home/jupyter
# >> [I 17:34:45.744 NotebookApp] Jupyter Notebook 6.1.5 is running at:
# >> [I 17:34:45.744 NotebookApp] http://host:port/
# >> [I 17:34:45.744 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).