Having your Spark Notebook inside the same cluster as the executors can reduce network errors and improve uptime. Since these network issues can result in job failure, this is an important consideration. This post assumes that you’ve already set up the foundation JupyterHub inside of Kubernetes deployment; the Dask-distributed notebook blog post covers that if you haven’t.

I like to think of this as washing my dog (Timbit) is a lot easier inside of the bath-tun than trying to wash him outside. Although it can take a bit of work to get him inside the tub [timbit-tub-img].

Timbit in the bath tub

If you’re interested my YouTube playlist of Get Spark Working with Notebook inside my Kubernetes (K8s/K3s) ARM cluster shows the journey I went on to get this working. A lot of my blog posts come out of my Open Source Live Streams (which even include Timbit sometimes).

To get a Spark notebook working inside of the cluster, we need to set up a few different things. The first step, similar to dask-kubernetes, is building a container with Jupyter and Spark installed. We also need to make a container of Spark for the executors. In addition to the containers, we need to set up permissions on the cluster and ensure that the executors that your Spark driver will launch have a way to talk to the driver in the notebook.

Note

It may seem like there are extra steps here compared to dask-kubernetes. Dask-kubernetes automates some service creation, which allows for communication between the scheduler, executors, and the notebook.

Building the Containers

We need two containers, one with Jupyter and Spark installed together and another with just Spark. Since we’re working in Python, there are some extra Python libraries we want to install as well (PyArrow, pandas, etc.) If you’ve got a specific version of a library that your project depends on, you’ll want to add it to both the Jupyter Spark driver container and the executor containers.

To start with we’ll download Apache Spark and decompress it, as shown in Download Spark, so that we can copy the desired parts inside our containers.

Example 1. Download Spark
if [ ! -f spark-3.0.1-bin-hadoop3.2.tgz ]; then
  axel https://ftp.wayne.edu/apache/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
fi
if [ ! -d spark-3.0.1-bin-hadoop3.2 ]; then
  tar -xvf spark-3.0.1-bin-hadoop3.2.tgz
fi

Now that we have Spark downloaded we can start customizing our Dockerfiles.

Building the Jupyter Spark Container

The easiest way to build a Jupyter Spark container is to install Spark on top of the base Jupyter container. If you’re running on ARM, you’ll need to first cross-build the base Jupyter container (see my instructions in the previous post).

In my case I’ve custom built the single-user sample Docker container from zero-to-jupyterhub-k8s to holdenk/jupyter-hub-magicsingleuser-sample:0.10.2 as I needed ARM support. If you don’t need to cross-build your custom container, you can use the pre-built container at jupyterhub/k8s-singleuser-sample as the basis for yours.

Since Spark needs Java to run, I decided to look at the jdk11 slim dockerfile to see how to install Java in a dockerfile well. If you’re an object-oriented person, you might be wishing we had multiple-inheritence with Dockerfiles, but that doesn’t work. In addition to the JDK11 dockerfile, I looked at Spark’s own Dockerfiles (includign PySpark) and the resulting Juptyer Spark Container specification is shown in Dockerfile to add Spark on top of the Jupyter Notebook container..

Example 2. Dockerfile to add Spark on top of the Jupyter Notebook container.
FROM  holdenk/jupyter-hub-magicsingleuser-sample:0.10.2

# Switch to root to install stuff
USER root

# Use multiple cores to compile the C code :)
ENV MAKEFLAGS -j 4


RUN set -eux; \
    apt-get update && \
    apt-get install -yq graphviz git build-essential cmake telnet && \
    ln -s /lib /lib64 && \
    apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps ca-certificates p11-kit wget bzip2 git mercurial subversion && \
    mkdir -p /opt/spark && \
    mkdir -p /opt/spark/examples && \
    mkdir -p /opt/spark/work-dir && \
    touch /opt/spark/RELEASE && \
    rm /bin/sh && \
    ln -sv /bin/bash /bin/sh && \
    echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
    chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
    conda install -c conda-forge --yes mamba && \
    mamba install --yes python==3.8.6 && \
    pip install --upgrade pip setuptools && \
    mamba install --yes numpy==1.19.2 pandas cytoolz numba lz4 scikit-build python-blosc=1.9.2 && \
    (mamba install --yes pyarrow ||  pip install -vvv pyarrow) && \
    apt-get clean && \
    rm -rf /var/cache/apt/* && \
    rm -rf /var/lib/apt/lists/* && \
    echo -e "$NB_USER\n$NB_USER" | passwd $NB_USER



# Based on https://github.com/docker-library/openjdk/blob/master/11/jdk/slim-buster/Dockerfile

# Default to UTF-8 file.encoding
ENV LANG C.UTF-8

ENV JAVA_HOME /usr/local/openjdk-11
ENV PATH $JAVA_HOME/bin:$PATH
RUN { echo '#/bin/sh'; echo 'echo "$JAVA_HOME"'; } > /usr/local/bin/docker-java-home && chmod +x /usr/local/bin/docker-java-home && [ "$JAVA_HOME" = "$(docker-java-home)" ]
ENV JAVA_VERSION 11.0.9.1

RUN set -eux; \
	\
	arch="$(dpkg --print-architecture)"; \
# this "case" statement is generated via "update.sh"
	case "$arch" in \
# arm64v8
		arm64 | aarch64) downloadUrl=https://github.com/AdoptOpenJDK/openjdk11-upstream-binaries/releases/download/jdk-11.0.9.1%2B1/OpenJDK11U-jdk_aarch64_linux_11.0.9.1_1.tar.gz ;; \
# amd64
		amd64 | i386:x86-64) downloadUrl=https://github.com/AdoptOpenJDK/openjdk11-upstream-binaries/releases/download/jdk-11.0.9.1%2B1/OpenJDK11U-jdk_x64_linux_11.0.9.1_1.tar.gz ;; \
# fallback
		*) echo >&2 "error: unsupported architecture: '$arch'"; exit 1 ;; \
	esac; \
	\
	savedAptMark="$(apt-mark showmanual)"; \
	apt-get update; \
	apt-get install -y --no-install-recommends \
		dirmngr \
		gnupg \
		wget \
	; \
	rm -rf /var/lib/apt/lists/*; \
	\
	wget -O openjdk.tgz.asc "$downloadUrl.sign"; \
	wget -O openjdk.tgz "$downloadUrl" --progress=dot:giga; \
	\
	export GNUPGHOME="$(mktemp -d)"; \
# TODO find a good link for users to verify this key is right (https://mail.openjdk.java.net/pipermail/jdk-updates-dev/2019-April/000951.html is one of the only mentions of it I can find); perhaps a note added to https://adoptopenjdk.net/upstream.html would make sense?
# no-self-sigs-only: https://salsa.debian.org/debian/gnupg2/commit/c93ca04a53569916308b369c8b218dad5ae8fe07
	gpg --batch --keyserver ha.pool.sks-keyservers.net --keyserver-options no-self-sigs-only --recv-keys CA5F11C6CE22644D42C6AC4492EF8D39DC13168F; \
# also verify that key was signed by Andrew Haley (the OpenJDK 8 and 11 Updates OpenJDK project lead)
# (https://github.com/docker-library/openjdk/pull/322#discussion_r286839190)
	gpg --batch --keyserver ha.pool.sks-keyservers.net --recv-keys EAC843EBD3EFDB98CC772FADA5CD6035332FA671; \
	gpg --batch --list-sigs --keyid-format 0xLONG CA5F11C6CE22644D42C6AC4492EF8D39DC13168F \
		| tee /dev/stderr \
		| grep '0xA5CD6035332FA671' \
		| grep 'Andrew Haley'; \
	gpg --batch --verify openjdk.tgz.asc openjdk.tgz; \
	gpgconf --kill all; \
	rm -rf "$GNUPGHOME"; \
	\
	mkdir -p "$JAVA_HOME"; \
	tar --extract \
		--file openjdk.tgz \
		--directory "$JAVA_HOME" \
		--strip-components 1 \
		--no-same-owner \
	; \
	rm openjdk.tgz*; \
	\
# TODO strip "demo" and "man" folders?
	\
	apt-mark auto '.*' > /dev/null; \
	[ -z "$savedAptMark" ] || apt-mark manual $savedAptMark > /dev/null; \
	apt-get purge -y --auto-remove -o APT::AutoRemove::RecommendsImportant=false; \
	\
# update "cacerts" bundle to use Debian's CA certificates (and make sure it stays up-to-date with changes to Debian's store)
# see https://github.com/docker-library/openjdk/issues/327
#     http://rabexc.org/posts/certificates-not-working-java#comment-4099504075
#     https://salsa.debian.org/java-team/ca-certificates-java/blob/3e51a84e9104823319abeb31f880580e46f45a98/debian/jks-keystore.hook.in
#     https://git.alpinelinux.org/aports/tree/community/java-cacerts/APKBUILD?id=761af65f38b4570093461e6546dcf6b179d2b624#n29
	{ \
		echo '#!/usr/bin/env bash'; \
		echo 'set -Eeuo pipefail'; \
		echo 'if ! [ -d "$JAVA_HOME" ]; then echo >&2 "error: missing JAVA_HOME environment variable"; exit 1; fi'; \
# 8-jdk uses "$JAVA_HOME/jre/lib/security/cacerts" and 8-jre and 11+ uses "$JAVA_HOME/lib/security/cacerts" directly (no "jre" directory)
		echo 'cacertsFile=; for f in "$JAVA_HOME/lib/security/cacerts" "$JAVA_HOME/jre/lib/security/cacerts"; do if [ -e "$f" ]; then cacertsFile="$f"; break; fi; done'; \
		echo 'if [ -z "$cacertsFile" ] || ! [ -f "$cacertsFile" ]; then echo >&2 "error: failed to find cacerts file in $JAVA_HOME"; exit 1; fi'; \
		echo 'trust extract --overwrite --format=java-cacerts --filter=ca-anchors --purpose=server-auth "$cacertsFile"'; \
	} > /etc/ca-certificates/update.d/docker-openjdk; \
	chmod +x /etc/ca-certificates/update.d/docker-openjdk; \
	/etc/ca-certificates/update.d/docker-openjdk; \
	\
# https://github.com/docker-library/openjdk/issues/331#issuecomment-498834472
	find "$JAVA_HOME/lib" -name '*.so' -exec dirname '{}' ';' | sort -u > /etc/ld.so.conf.d/docker-openjdk.conf; \
	ldconfig; \
	\
# basic smoke test
	fileEncoding="$(echo 'System.out.println(System.getProperty("file.encoding"))' | jshell -s -)"; [ "$fileEncoding" = 'UTF-8' ]; rm -rf ~/.java; \
	javac --version; \
	java --version


# Based on the Spark dockerfile

COPY jars /opt/spark/jars
COPY bin /opt/spark/bin
COPY sbin /opt/spark/sbin
COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/
# Wildcard so it covers decom.sh present (3.1+) and not present (pre-3.1)
COPY kubernetes/dockerfiles/spark/decom.sh* /opt/
COPY examples /opt/spark/examples
COPY kubernetes/tests /opt/spark/tests
COPY data /opt/spark/data
# We need to copy over the license file so we can pip install PySpark
COPY LICENSE /opt/spark/LICENSE
COPY licenses /opt/spark/licenses

ENV SPARK_HOME /opt/spark

# Note: don't change the workdir since then your Jupyter notebooks won't persist.
RUN chmod g+w /opt/spark/work-dir
# Wildcard so it covers decom.sh present (3.1+) and not present (pre-3.1)
RUN chmod a+x /opt/decom.sh* || echo "No decom script present, assuming pre-3.1"

# Copy pyspark with setup files and everything
COPY python ${SPARK_HOME}/python

# Add PySpark to PYTHON_PATH

RUN pip install -e ${SPARK_HOME}/python

# Switch to the user back to a non-root user that will actually do the running
USER $NB_USER

Since the Dockerfile copies parts of Spark in, remember to save it at the root of where you decompressed the Spark tarball.

If you’re not cross building, you can build this with a regular docker build, in my case since I’m targetting arm and x86 I did built it as shown in Build Spark notebook container.

Example 3. Build Spark notebook container
docker buildx build -t holdenk/spark-notebook:v3.0.1.2  --platform linux/arm64,linux/amd64 --push .
Note

An alternative would have been to take the JDK-11 containers as a starting point, and install Jupyter on top of it, but when I tried that I found it more complicated.

This gives us a container with both Spark and the base notebook layer together. For the executors, we don’t want to bother shipping Jupyter, so we’ll build a seperate container for the executors.

Building the Executor Container

Spark does not ship pre-built containers for its executors, so regardless of which arch you’re using, you will need to build the executor containers.

If you’re building multi-arch containers, you will need to update Spark’s docker image tool. You will need to change the buildx option to push the images by adding "--push" to the docker buildx commands in the script for ./bin/docker-image-tool.sh.

Spark’s Python container Dockerfile installs an older version of Python without any dependencies, so you will want to customize your Python container setup, as well. My Dockerfile is shown in Dockerfile customizing PySpark setup.

Example 4. Dockerfile customizing PySpark setup
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

ARG base_img

FROM $base_img
WORKDIR /

# Reset to root to run installation tasks
USER 0

ENV PATH /opt/conda/bin:$PATH

RUN mkdir ${SPARK_HOME}/python

RUN apt-get update --fix-missing && \
    apt-get install -yq graphviz git build-essential cmake telnet && \
    apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps ca-certificates p11-kit wget bzip2 git mercurial subversion && \
    rm /bin/sh && \
    ln -sv /bin/bash /bin/sh

COPY bin/pysetup.sh /pysetup.sh 
RUN chmod a+x /pysetup.sh && ./pysetup.sh
RUN conda install -c conda-forge --yes mamba && \
    mamba install --yes python==3.8.6 && \
    pip install --upgrade pip setuptools && \
    mamba install --yes numpy==1.19.2 pandas cytoolz numba lz4 scikit-build python-blosc=1.9.2 && \
    (mamba install --yes pyarrow ||  pip install -vvv pyarrow)


COPY python/pyspark ${SPARK_HOME}/python/pyspark
COPY python/lib ${SPARK_HOME}/python/lib

WORKDIR /opt/spark/work-dir
ENTRYPOINT [ "/opt/entrypoint.sh" ]

# Specify the User that the actual main process will run as
ARG spark_uid=185
USER ${spark_uid}

You’ll see this file references pysetup.sh which installs Python using Miniforge so we can support arm as shown in Setup python.

Example 5. Setup python
#!/bin/bash

set -ex
export arch=$(uname -m)
if [ "$arch" == "aarm64" ]; then
  arch="arm64";
fi
wget --quiet https://github.com/conda-forge/miniforge/releases/download/4.8.5-1/Miniforge3-4.8.5-1-Linux-${arch}.sh -O ~/miniforge.sh
chmod a+x ~/miniforge.sh
~/miniforge.sh -b -p /opt/conda
/opt/conda/bin/conda clean -tipsy
ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh
echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc
echo "conda activate base" >> ~/.bashrc
echo ". /opt/conda/etc/profile.d/conda.sh" >> /etc/bash.bashrc
echo "conda activate base" >> /etc/bash.bashrc
source ~/.bashrc
find /opt/conda/ -follow -type f -name '*.a' -delete
find /opt/conda/ -follow -type f -name '*.js.map' -delete
/opt/conda/bin/conda clean -afy
/opt/conda/bin/conda install --yes nomkl cytoolz cmake tini
/opt/conda/bin/conda init bash
/opt/conda/bin/conda install --yes mamba

You will want to make your Dockerfile install the dependencies for your program while making sure to select the same version of Python that you have in your Jupyter container, so you may need to modify those two examples.

Once you’ve configured your enviroment, you can build your Spark image using the docker-image-tool that ships with Spark as shown in [build_exec_containers].

Example 6. Build the exec containers
# Copy over python setup script so we can have matching pythons
cp pysetup.sh ./spark-3.0.1-bin-hadoop3.2/bin/
pushd spark-3.0.1-bin-hadoop3.2
SPARK_HOME=`pwd`
SPARK_ROOT="$SPARK_HOME"
./bin/docker-image-tool.sh  -r holdenk -t v3.0.1.2 -X -b java_image_tag=11-jre-slim -p PyDockerfile Dockerfile build
Warning

Some parts of Spark may assume a specific layout of the container, e.g. in Spark 3.1 the decommissioning integration makes certain assumptions, so be careful when making changes.

Setting up Kubernetes Permissions

The driver program needs the ability to launch new pods for executors. To allow launching, create a service account or give permissions to the default service account (SA) . In my case, I decided to add permissions to the "dask" service account since the JupyterHub launcher (covered later) doesn’t support different service accounts depending on the notebook. I also created a special "spark" namespace to make it easier to watch what was happening. My namespace and SA setup is shown in Setup up namespace and service account.

Example 7. Setup up namespace and service account
kubectl create serviceaccount -n jhub spark
kubectl create namespace spark
kubectl create rolebinding spark-role --clusterrole=edit --serviceaccount=jhub:spark --namespace=spark
# We can't override SA in the launcher on per-container basis, so since I've already got a dask SA.
kubectl create rolebinding spark-role-to-dask-acc --clusterrole=edit --serviceaccount=jhub:dask --namespace=spark

Creating a Service (Allowing Driver-Executor Communication)

Spark depends on the executors connecting back to the driver for both the driver its self and the driver’s BlockManager. If your driver is in a different namespace, the easiest way to allow communication is to create a service to let the executors connect to the driver.

Example 8. The Spark Driver Service
apiVersion: v1
kind: Service
metadata:
  name: driver-service
spec:
  selector:
    app: jupyterhub
    component: singleuser-server
  ports:
    - name: driver
      protocol: TCP
      port: 2222
      targetPort: 2222
    - name: blockmanager
      protocol: TCP
      port: 7777
      targetPort: 7777
Example 9. Apply the Spark Driver Service
kubectl create service -n jhub driver-service.yaml

These port numbers are arbitrary (you can pick different ones), but you’ll need to remember them when configuring your SparkContext.

Configuring Your JupyterHub Launcher

Now that you have all of the foundational components set up, it’s time to add them to your JupyterHub launcher. I did this by adding the Spark 3.0.1 option to the profileList in my config.yaml shown in [my-jupyter-config].

Example 10. Combined Spark and Dask Jupyter Config
hub:
  image:
    name: holdenk/jupyter-hub-magichub
proxy:
  service:
    type: NodePort
  secretToken: DIFFERENTSECRET
  secretSync:
    image:
      name: holdenk/jupyter-hub-magicsecret-sync
      tag: '0.10.2'
  chp:
    image:
      name: holdenk/jconfigurable-http-proxy
      tag: '0.0.1'
ingress:
  enabled: true
  hosts:
    - holdenkarau.mooo.com
  tls:
   - hosts:
      - holdenkarau.mooo.com
     secretName: k3s-mooo-tls
singleuser:
  serviceAccountName: dask
  networkTools:
    image:
      name: holdenk/jupyter-hub-magicnetwork-tools
      tag: '0.10.2'
  image:
    name: holdenk/jupyter-hub-magicsingleuser-sample
    tag: '0.10.2'
  profileList:
    - display_name: "Minimal environment"
      description: "To avoid too much bells and whistles: Python."
      default: true
    - display_name: "Dask container"
      description: "If you want to run dask"
      kubespawner_override:
        image: holdenk/dask-notebook:v0.9.4b
    - display_name: "Spark 3.0.1 container"
      description: "If you want to run Spark"
      kubespawner_override:
        image: holdenk/spark-notebook:v3.0.1.1
prePuller:
  hook:
    image:
      name: holdenk/jupyter-hub-magicimage-awaiter
      tag: '0.10.2'
# Do something better here! It's being reworked though - https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1871
auth:
  type: dummy
  dummy:
    password: 'mypassword'
  whitelist:
    users:
      - user1
      - user2

You can then upgrade your previous deployment with helm upgrade --cleanup-on-fail --install $RELEASE jupyterhub/jupyterhub --namespace $NAMESPACE --create-namespace --version=0.10.2 --values config.yaml.

Configuring Your SpakContext

Now that you can launch a notebook with everything needed for Spark, it’s time to talk about configuring your SparkContext to work in this environment. You’ll need more configuration than you can get through the SparkContext constructor directly, so you will also need to import the SparkConf. Your imports might look like Spark Imports.

Example 11. Spark Imports
from pyspark import *
from pyspark.context import *
from pyspark.conf import *

In my cluster, the K8s API is available at https://kubernetes.default, so I start my configuration as in Start of Spark Conf.

Example 12. Start of Spark Conf
conf = (SparkConf().setMaster("k8s://https://kubernetes.default")

Since there are no pre-built docker images for Spark, you’ll need to configure the container image used for the executor, mine is shown in Configure Container.

Example 13. Configure Container
    .set("spark.kubernetes.container.image", "holdenk/spark-py:v3.0.1.2")

Normally Spark assigns ports randomly for things like the driver and the block manager, but we need to configure Spark to bind to the correct ports, and also have the executors connect to the service we’ve created instead of trying to connect back to the hostname of the driver. My service configuration is shown in Spark Network Conf.

Example 14. Spark Network Conf
    .set("spark.driver.port", "2222") # Needs to match svc
    .set("spark.driver.blockManager.port", "7777")
    .set("spark.driver.host", "driver-service.jhub.svc.cluster.local") # Needs to match svc
    .set("spark.driver.bindAddress", "0.0.0.0") #  Otherwise tries to bind to svc IP, will fail

In addition to that, you’ll need to tell Spark which namespace it has permission to create executors in, shown in Spark Namespace Conf.

Example 15. Spark Namespace Conf
    .set("spark.kubernetes.namespace", "spark")

While it’s not essential, configuring an application name makes debugging much easier. You can do this with .set("spark.app.name", "PySparkHelloWorldInsideTheCluster").

Conclusion

The process of adding a Spark notebook to your JupyterHub launcher is a little more involved than it is for typical notebooks because of the required permissions and network connections. Moving inside the cluster from outside of the cluster can offer many advantages, especially if your connection to the cluster goes over the internet. If you aren’t familiar with Spark, there is a new version of Learning Spark by my former co-workers (or you can buy the old one I co-wrote, but it’s pretty out of date), along with Rachel & my High Performance Spark. Up next, I’m planning on deploying Ray on the cluster, then jumping back to Dask and with the GitHub and BitCoin data.