pyspark kernels

Pyspark Jupyter Kernels

A Jupyter Kernel is a program that runs and introspects user’s code. IPython is probably the most popular kernel for Jupyter. It can be ran independently from Jupyter, providing a powerful interactive Python shell. However,  being a Jupyter kernel, it provides interactive python development for Jupyter notebooks and interactive features.

Local and cluster Pyspark interactive environments can be provisioned via IPython. Similarly, this can be independent of Jupyter, utilizing IPython profiles, or more conveniently making use of Jupyter, by using the IPython Kernel. In this post the focus will be on the latter, which we proclaim Pyspark Jupyter Kernels (short: Pyspark Kernels). Audience that are interested in configuring IPython profiles for Pyspark can use this post as a starting point.

In this post we will show how to implement and share Pyspark Kernels for Jupyter. Our main contribution, is a generic Pyspark Kernel template, together with an automation script for the creation of custom Pyspark Kernels out of that template. Having Pyspark configured to run directly via a Jupyter Kernels, is seamlessly integrated with Jupyterhub deployments. Both artifacts presented here are open sourced in our git-hub repository, together with how to use instructions. This post is meant to provide more foundational information and background for our work.

Jupyter Vs. IPython

IPython was released to provide Interactive development for Python. This is where the name comes from (IPython=InteractivePython). As the development for that project continued and the repository got bigger, several components were realized as not specific merely to Python; in fact, several other programming languages have been integrated with the “IPython” notebook, most popular were Julia & R. So, why the name “IPython” if there are also other programming languages involved?

In 2014, Jupyter started as a spin-off project from IPython. The language-agnostic parts of the IPython project, such as the notebook format, message protocol, notebook web application, and others were put into the Jupyter project. In the Jupyter and IPython communities, this is called “The Big Split”. IPython now has only two roles to fulfill, being the Python backend to the Jupyter Notebook, which is also known as the kernel, and an interactive Python shell. On the other side, the Jupyter project support several other kernels including Julia, R, Scala, C++, Go, etc.

More details on the evolution of and some of the key differences between IPython and Jupyter can be found here.

Literature Review

Pyspark is built on top of Spark’s Java API. Data is processed in Python and cached / shuffled in the JVM. In the Python driver program, the SparkContext uses Py4J to launch a JVM and create a JavaSparkContext. Py4J is only used on the driver for local communication between the Python and JavaSparkContext objects. The main approach to use Pyspark interactively with Jupyter is to integrate the Pyspark shell with the IPython kernel. There are two main directions in the literature that “partially” provide that integration. In the following sections we will discuss these approaches and highlight their drawbacks.

Method 1 – Configure Pyspark Driver

export PYSPARK_DRIVER_PYTHON=jupyter

export PYSPARK_DRIVER_PYTHON_OPTS=’notebook’

We do this by exporting the environment variables seen above, and running the Pyspark shell. Pyspark will start via Jupyter, by default using the IPython Kernel. The main benefit of this method is that one can provide all his Spark configuration and settings while running the Pyspark shell as he would normally do. However, this method does not provide reproducible and/or shared environments and it poorly integrates with enterprise usage of Jupyter (think of a group of data-scientists in a big organization using Jupyterhub). Additionally, while the user is free to configure his executor’s virtual environment by exporting PYSPARK_PYTHON, it is a bit unnatural to reconcile which virtual environment is the driver using. Since we have exported PYSPARK_DRIVER_PYTHON=jupyter (will keep that for the readers to figure out and find a way to control the driver’s virtual environment).

Method 2 – Configure a Jupyter Kernel

While there are several blogposts describing how to configure Pyspark Jupyter kernels (through IPython). They lack a process or an approach for automating the creation of those kernels, additionally they either do not speak about how to customize your spark environment or rely solely on the default spark configuration and environment files. From the literature available for that method, this and this posts were a great inspiration for the work presented here.

IPython Kernels

A Jupyter kernel identifies itself to Jupyter by creating a directory, the name of which is used as an identifier for the kernel. A Kernel [directory] may be created in a number of locations, between system, environment (ex. shared with other users) and local (i.e. user specific). These have default paths, and can be provisioned by the user by exporting JUPYTER_PATH environment variable. More details can be found here (note: after reading the referenced section; the kernel-directory, shall be created in a path similar to ${JUPYTER_DATA_DIR}/kernels/${KERNEL_NAME}. All Jupyter data directories can be viewed using the command jupyter –paths).

The actual kernel definition is a kernel.json file placed in the kernel directory. For IPython this shall include starting a python interpreter (in your desired virtual environment), using the IPython.kernel, and optionally providing PYTHON related environment variables, ex. PYTHONPATH, PYTHONSTARTUP. The following is a minimal IPython Kernel definition.

{

“argv”: [“python3”, “-m”, “ipykernel”, “-f”, “{connection_file}”],

 “display_name”: “Python 3”,

 “language”: “python”

}

 

The specifications of the JSON file, and more detailed overview is referenced here.

jupyter kernels

Figure 1: Visual representation of the IPython kernel. The Jupyter Notebook Server, the Kernel definition file and the Python virtual environment exist in the same machine (scoped by the dotted lines). The browser can be open on the same machine as the notebook server (ex. if Jupyter is running on your desktop) or a different one (if Jupyter is running on a remote node).

Pyspark Kernels

Assuming that creating virtual environments is not the issue here, in a typical scenario, one would like to access and switch between his Pyspark virtual environments from his “single-user” Jupyter Notebook Server. Additionally, one would like to be able to share these virtual environments with other developer groups in the same organization. The “single-user” Jupyter Notebook Server, can be ran instantly by the user via Jupyter, or by logging into Jupyterhub, etc.

jupyter kernels 2

Figure 2: Example of a Jupyter Notebook Server, where the logged in user has access to 3 Pyspark Kernels. PysparkLocal1 and 2, representing Pyspark kernels that are only visible to that user account, and PysparkShared1, representing a kernel that is shared with other users.

Similar to the IPython kernel, a Pyspark Kernel, shall deal with a python interpreter (and a virtual environment), running on a single machine. However, the Pyspark Kernel is expected to interact with the Pyspark driver, which requires an additional layer of configuration for providing spark configurations and environment variables (spark-master, executor-cores, executor-ram, PYSPARK_PYTHON, …etc.). It is essential that these configuration parameters are encapsulated into the kernel’s definition, kernel.json file.

A Pyspark Kernel in that context, defines a specific spark configuration tied to a specific python virtual environment. The spark configuration can include any spark configuration parameter or environment variable. The python virtual environment, is ex. an anaconda virtual environment that exists somewhere locally (at the node where the kernel will be registered and the Jupyter process is expected to be run). Additionally, a Pyspark Kernel shall reference all the Pyspark artifacts and dependencies that do not typically exist in created virtual environments (ex. pyspark python library, and py4j).

In summary, to implement a complete Pyspark Kernel (using IPython) the following is necessary:

  • using the IPython kernel to start the desired spark-driver python interpreter (and virtual environment)
  • adding spark and its dependency py4j to the python path (typically found in ${SPARK_HOME}/python/ and ${SPARK_HOME}/python/lib/py4j-10.4-src.zip correspondingly). Note that the py4j version here is assumed to be 0.10.4.
  • exporting spark environment variables (ex. providing details on the driver and executors virtual environments locations)
  • providing spark configuration parameters (ex. –master, –archives, –executor-cores, …etc.)
  • executing the spark initialization shell script before starting that interpreter (typically found in ${SPARK_HOME}/python/pyspark/shell.py). This shall be responsible for initializing and provisioning the spark context.

pyspark_kernel.template

{

“display_name”: ${KERNEL_NAME},

“language”: “python”,

“argv”: [ ${PYSPARK_DRIVER_PYTHON}, “-m”, “ipykernel”, “-f”, “{connection_file}” ],

“env”: {

   “SPARK_HOME”: ${SPARK_HOME},

   # sets the search path for importing python modules

   “PYTHONPATH”: ${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip,

   “PYSPARK_DRIVER_PYTHON”: ${PYSPARK_DRIVER_PYTHON},

   “PYSPARK_PYTHON”: ${PYSPARK_PYTHON},

   “PYSPARK_SUBMIT_ARGS”: ${PYSPARK_SUBMIT_ARGS},

   # specifying the location to a python script, that will be run by python

   # before starting the python interactive mode (interpreter)

   “PYTHONSTARTUP”: ${SPARK_HOME}/python/pyspark/shell.py

}

}

 

The template above is a valid “complete” Pyspark Kernel template that is written in hocon format. By exporting the self-descriptive environment variables above and running a command like cat pyspark_kernel.template | pyhocon -f json >> ${JUPYTER_DATA_DIR}/kernels/${KERNEL_NAME}/kernel.json a new Pyspark Kernel will be created and is directly accessible to the user from his single user Jupyter Notebook Server. The Pyspark Kernel includes virtual environment information and Spark configuration details.

Note that:

  • the location/access rights on the ${JUPYTER_DATA_DIR} typically controls whether the created Pyspark Kernel is local to the user or shared with other users.
  • the command provisioned above uses pyhocon python library to parse the pyspark_kernel.template file.

Figure 3: The difference from IPython Kernels, is that, each of the Pyspark Kernels, does not only include information about the python interpreter and virtual environment to be used, but additional configuration parameters and environment variables which specify how the SparkContext will be initialized.  (Note: local vs shared Pyspark Kernels, which are decided based on access rights and permissions, are not to be confused with local vs cluster Pyspark deployments, which are specified by –master spark configuration parameter).

Automating Generation of Pyspark Kernels

So far we have defined a view for what a Pyspark Kernel is, and provided a generic Pyspark kernel template in hocon format. The final step is to automate the generation of Pyspark Kernels from the provisioned template. We achieve this through a bash-script, that converts user specific inputs (ex: virtual environment path, spark configuration parameters, etc.) to their corresponding environment variables in the template, and generate the kernel.json file. We use Pyhocon for the generation of the kernel specification file. Pyhocon is a python library and for convenience it is installed in the driver’s virtual environment. Additionally, the Jupyter library is installed at the driver’s virtual environment, to use the IPython kernel.

The pyspark_kernel.sh accepts the following inputs

  • -t | –kernels_template_path: path to pyspark_kernel.template
  • -d | –kernels_dir_path: root location for the kernels dir (for JUPYTER)
  • -k | –kernel_name: the kernel_name
  • -e | –venv_dir_path: path to the virtual environment to be used by both the spark driver and executors
  • –spark_home: spark home
  • –spark_master: currently supporting local[*] and yarn
  • –spark.*: (optional) any spark configuration parameter that can be provided to spark via PYSPARK_SUBMIT_ARG

Note: pyspark_kernel.template and pyspark_kernel.sh are kept very generic. However, in typical scenarios –kernels_template_path, –kernels_dir_path, –spark_home, –spark_master can be hardcoded based on the cluster being utilized. In that case the only inputs required by the user are –kernel_name, –venv_dir_path and –spark.*

Conclusion

Confusion about running and configuring Pyspark together with Jupyter has been out for too long. While there are several posts out there outlining alternative ways for linking Pyspark to Jupyter, the provided methods do not include reproducible/shared environments, poorly integrates with enterprise usage of Jupyter (like Jupyterhub), and/or lack custom configuration for the provisioned spark environment. The situation is even more complicated with the confusion in literature between Jupyter and IPython, where many people uses them interchangeably.

In this blogpost we provided background knowledge on Jupyter and IPython, together with current state of integrating Pyspark with Jupyter. We provided a view for what a Pyspark Jupyter Kernel is and built a generic template and automation script on top of that definition. These artifacts are available as open source in our git-hub repository, together with how-to-use instructions.

Like this article and want to stay updated of more news and events?
Then sign up for our newsletter!