PySpark + Anaconda + Jupyter (Windows)

It seems like just about every six months I need to install PySpark and the experience is never the same. Note that this isn't necessarily the fault of Spark itself. Instead, it's a combination of the many different situations under which Spark can be installed, lack of official documentation for each and every such situation, and me not writing down the steps I took to successfully install it. So today, I decided to write down the steps needed to install the most recent version of PySpark under the conditions in which I currently need it: inside an Anaconda environment on Windows 10.

Note that the page which best helped produce the following solution can be found here (Medium article). I later found a second page with similar instructions which can be found here (Towards Data Science article).

Steps to Installing PySpark for use with Jupyter

This solution assumes Anaconda is already installed, an environment named `test` has already been created, and Jupyter has already been installed to it.

1. Install Java

Make sure Java is installed.

It may be necessary to set the environment variables for `JAVA_HOME` and add the proper path to `PATH`.

In the situation that you cannot go into the system menu to edit these settings, they can be temporarily set from within Jupyter:

import os, sys
os.environ["JAVA_HOME"] = "c:\\Program Files\\Java\jre1.8.0_202"
sys.path.insert(0, os.environ["JAVA_HOME"] + "\\bin")

Replace the version name and number as necessary (e.g., jdk1.8.0.201, etc.).

2. Install Spark

We choose to install pyspark from the conda-forge channel. As an example, let's say I want to add it to my `test` environment. Then in the terminal I would enter the following:

`conda activate test`
`conda install -c conda-forge pyspark`

Now set `SPARK_HOME`.

As in Step 1, if you cannot go into the system menu to add this variable, then it can be temporarily set from within Jupyter:

import os
os.environ["SPARK_HOME"] = "c:\\Users\\{user.name}\\Anaconda3\\envs\\{environment.name}\\Lib\\site-packages\\pyspark"

3. Setup winutils.exe

While I was able to get Spark running without winutils, in order to export my Spark dataframe to a parquet file, I had to download and create an environment variable for it.

Obtain a version of winutils from here (Github repo); I used hadoop-3.0.0.

I dropped this into "c:\hadoop\bin\" and correspondingly set the environment variable in my notebook as follows:
`os.environ["HADOOP_HOME"] = "c:\\hadoop\\"`

4. Use Pyspark with Jupyter

It may be enough to initialize a spark session with the following:

import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = pyspark.SparkConf()
conf.setAppName('mySparkApp')
conf.setMaster('local')
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)

Test that spark is running by executing the following cell:

nums = sc.parallelize([1,2,3,4])
nums.count()

In the case that the installation doesn't work, we may have to install and run the `findspark` module.

At the command line, run the following inside your environment:
`conda install -c conda-forge findspark`

Then, inside the notebook, prior to the import of pyspark and after the setting of `SPARK_HOME`, run the following:

import findspark
findspark.init()
findspark.find()

Summary/Recap

At the end of the day, we might have ran the following in the terminal:

`conda activate test`
`conda install -c conda-forge pyspark`
`conda install -c conda-forge findspark`

Not mentioned above, but an optional step here is to test Spark from directly in the terminal.

We would then download `winutils.exe` and place it into `c:\hadoop\bin\`

Then, opening up Jupyter, we may have something like the following in our Jupyter notebook:

import os, sys
os.environ["HADOOP_HOME"] = "c:\\hadoop\\"
os.environ["JAVA_HOME"] = "c:\\Program Files\\Java\jre1.8.0_202"
sys.path.insert(0, os.environ["JAVA_HOME"] + "\\bin")
os.environ["SPARK_HOME"] = "c:\\Users\\{user.name}\\Anaconda3\\envs\\{environment.name}\\Lib\\site-packages\\pyspark"

import findspark
findspark.init()
findspark.find()

import pyspark
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = pyspark.SparkConf()
conf.setAppName('mySparkApp')
conf.setMaster('local')
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession(sc)

nums = sc.parallelize([1,2,3,4])
nums.count()

Search This Blog

AoM: Data Science