Jekyll2021-07-06T22:11:00-07:00https://scalingpythonml.com/feed.xmlScaling Python MLBlog of my adventures working with different tools for scaling Python ML workloads.Initial Steps at Getting polynote and ZeroToJupyterHub to work together (ish)2021-07-06T00:00:00-07:002021-07-06T00:00:00-07:00https://scalingpythonml.com/2021/07/06/initial-steps-getting-polynote-and-zero-to-jupyterhub-to-work-together-ish= Initial Steps at Getting polynote and ZeroToJupyterHub to work together (ish)
JupyterHub has long been the de facto leader in the notebook space. Polynote is a new approach to notebooks with multi-language kernels and a focus on first-class Scala support. There is a lot of tooling to make deploying & managing JupyterHub easy for various environments. Since Polynote is relatively new, the same tooling does not yet exist. Since I am lazy, and I wanted a great Notebook for Spark (w/ Scala & Python), I figured I'd try and see if I could get my ZeroToJupyterHub deployment to launch a Polynote notebook. This way I can have multi-user Kubernetes deployments of Polynote alongside my other Kernels that I'm using (Python w/Dask, Python w/Ray, etc.). It turns out, yes it is possible, and there are of course some things I learned along the way and despite my initial thought this would "take an hour or so" it ended up being an 8 part stream, and I don't have security set up.
The first step I did was creating a new Dockerfile for JupyterHub to launch so that we could put any customizations needed inside of it. We could view this Dockerfile as "foreshadowing" for the other changes I'm going to describe:
[source, dockerfile]
----
FROM holdenk/polynote:dev
# Being root again
USER root
# A script to ignore everything we tell it. I'm sure that's useful
COPY scripts/jupyter-fake-launch.sh ./polynote/
# Copy config file
COPY --chown=${NB_USER}:${NB_USER} docker/jupyter/config.yml /config.yml
# Back to being a safe-ish user
USER ${NB_USER}
# Use our custom launch script
ENTRYPOINT ["./polynote/jupyter-fake-launch.sh"]
----
Polynote is configured by a config.yml file, and ZeroToJupyterHub mounts the user storage at /home/joyvan & expects the server to be listening on port 8888 (instead of Polynotes default of 8192). So I added a config.yml of
[source, yaml]
----
listen:
host: 0.0.0.0
port: 8888
storage:
dir: /home/jovyan
----
ZeroToJupyterHub is designed to launch JupyterHub containers and so it passes some command-line arguments as expected by JupyterHub. The simplest option I could think of was to create a script that ignored most of the information from JupyterHub and reformatted the other arguments as needed.
[source,bash]
----
#!/usr/bin/env bash
set -ex
# Ignore everything the Jupyter launcher tells us
cp -r /opt/notebooks/examples /home/jovyan/ || echo "Examples copied"
# Use the JUPYTERHUB_SERVICE_PREFIX as the base_uri since we need to match the reverse proxy set up by zero-to-jupyterhub.
echo "
ui:
base_uri: $JUPYTERHUB_SERVICE_PREFIX
" >> /config.yml
./polynote/polynote.py --config /config.yml
----
Unfortunately, while Polynote has support for being served by reverse proxies at sub-urls it makes the assumption that the reverse proxy is always rewriting the requests to be relative to "/". Fixing this required adding a new function (normalizePath) which was relatively straightforward, although I did screw up putting it in all of the right places which led me to a lot of confusion (and is honestly part of why it was an 8 part stream).
[source, scala]
----
def normalizePath(path: String): String = {
path.stripPrefix(userPrefix)
}
----
All in all the (not very nice, but serviceable) integration is on my https://github.com/holdenk/polynote/tree/integrate-with-zero-to-jupyterhub[GitHub Polynote integrate-with-zero-to-jupyterhub branch].
In addition, while I was working on this I found various issues with the Polynote Dockerfile and build instructions not being in sync. This is pretty normal, but since I was new to the project I took several miss-steps, but I've submitted some PRs (including some merged) to clarify the issues for others.
While it's possible to launch the Polynote notebooks there is still a lot of work to be done. First off is there is no security once the notebook is launched, and the second matter is including the traditional JupyterHub header so the user can stop & restart their Polynote environment as desired.JupyterHub has long been the de facto leader in the notebook space. Polynote is a new approach to notebooks with multi-language kernels and a focus on first-class Scala support. There is a lot of tooling to make deploying & managing JupyterHub easy for various environments. Since Polynote is relatively new, the same tooling does not yet exist. Since I am lazy, and I wanted a great Notebook for Spark (w/ Scala & Python), I figured I’d try and see if I could get my ZeroToJupyterHub deployment to launch a Polynote notebook. This way I can have multi-user Kubernetes deployments of Polynote alongside my other Kernels that I’m using (Python w/Dask, Python w/Ray, etc.). It turns out, yes it is possible, and there are of course some things I learned along the way and despite my initial thought this would "take an hour or so" it ended up being an 8 part stream, and I don’t have security set up.Setting up Per User Secrets and other customizations with Jupyterhub on Kubernetes2021-06-30T00:00:00-07:002021-06-30T00:00:00-07:00https://scalingpythonml.com/2021/06/30/setting-up-per-user-creds-with-zero-to-jupyterhub= Setting up Per User Secrets and other customizations with Jupyterhub on Kubernetes
While we've already set up Jupyterhub using zero-to-jupyterhub on Kubernetes, I wanted to expand access to my other prospective co-authors as part of trying to convince them to write with me (again) :). While I generally speaking trust my co-authors, I still like having some kind of access controls, so we don't accidentally stomp on each-others work.
In Kubernetes the main way of doing this is with service accounts and secrets.
While we can configure this globally, configuring this per-user needs a bit of custom code.
[source, yaml]
----
include::../subrepos/scalingpythonml/jupyter/multiuser.yaml[]
----
Most of the magic is inside `preSpawnHook`. This also uses the `z2jh` library to allow it to load the config in `custom.users` in the YAML file. You could also point this to a database or something else, but given I've got about three users I figured in-line YAML was good enough for my case.
I'd like to thank https://github.com/consideRatio[consideRatio] for all his help, to be clear any mistakes are my own fault.While we’ve already set up Jupyterhub using zero-to-jupyterhub on Kubernetes, I wanted to expand access to my other prospective co-authors as part of trying to convince them to write with me (again) :). While I generally speaking trust my co-authors, I still like having some kind of access controls, so we don’t accidentally stomp on each-others work.A Quick Look at DF I/O (basic ETL w/JSON over http to CSV and Parquet on MinIO) in Dask2021-06-29T00:00:00-07:002021-06-29T00:00:00-07:00https://scalingpythonml.com/2021/06/29/a-quick-look-at-df-io-basic-etl-w-json-over-http-to-csv-and-parquet-on-minio= A Quick Look at DF I/O (basic ETL w/JSON over http to CSV and Parquet on MinIO) in Dask
Now that we've got Dask installed, it's time to try some simple data preparation and extract, transform, load(ETL). While ETL is often not the most exciting thing, getting data is the first step of most adventures. Data tools don't exist in a vacuum; the data normally comes from somewhere else, and the data or models we make need to be useable with other tools. Because of this, the formats and systems that a tool can interact with can make a difference between it being a fit or needing to keep looking. To simplify your life with I/O, you should make sure your notebook (or client) runs inside the same cluster as the workers.
For now, we'll start by taking all of the GitHub activity https://www.gharchive.org/[from gharchive] and re-partitioning it in a way that will allow us to try and train models on a per-organization and per-repo basis.
== File Systems (e.g., Data Stores, Sinks, and well File Systems)
Often, the need to scale our Python programs comes at least in part from larger input sizes. When we use distributed systems (like Kubernetes), the data must be accessible to all workers. For this reason, we end up needing to get our data over the network. This does not have to be what one would traditionally think of as a network file system (like, say, NFS or AFS); it can include things such as HTTP, S3, HDFS, etc. All of these protocols expose some common file-like access.
Dask's file access layer uses https://github.com/intake/filesystem_spec[FSSPEC], from the https://intake.readthedocs.io/en/latest/[intake project], to access the different file systems. Since FSSPEC supports such a range of file systems, it does not install the requirements for every supported file system. You can see what file systems are supported, and which ones need additional packages by running:
[source, python]
----
include::../subrepos/scalingpythonml/dask-examples/Dask - Explore S3MinIO.py[tags=known_fs]
----
In my case the known implementations returns:
[source]
----
{'file': {'class': 'fsspec.implementations.local.LocalFileSystem'},
'memory': {'class': 'fsspec.implementations.memory.MemoryFileSystem'},
'dropbox': {'class': 'dropboxdrivefs.DropboxDriveFileSystem',
'err': 'DropboxFileSystem requires "dropboxdrivefs","requests" and "dropbox" to be installed'},
'http': {'class': 'fsspec.implementations.http.HTTPFileSystem',
'err': 'HTTPFileSystem requires "requests" and "aiohttp" to be installed'},
'https': {'class': 'fsspec.implementations.http.HTTPFileSystem',
'err': 'HTTPFileSystem requires "requests" and "aiohttp" to be installed'},
'zip': {'class': 'fsspec.implementations.zip.ZipFileSystem'},
'gcs': {'class': 'gcsfs.GCSFileSystem',
'err': 'Please install gcsfs to access Google Storage'},
'gs': {'class': 'gcsfs.GCSFileSystem',
'err': 'Please install gcsfs to access Google Storage'},
'gdrive': {'class': 'gdrivefs.GoogleDriveFileSystem',
'err': 'Please install gdrivefs for access to Google Drive'},
'sftp': {'class': 'fsspec.implementations.sftp.SFTPFileSystem',
'err': 'SFTPFileSystem requires "paramiko" to be installed'},
'ssh': {'class': 'fsspec.implementations.sftp.SFTPFileSystem',
'err': 'SFTPFileSystem requires "paramiko" to be installed'},
'ftp': {'class': 'fsspec.implementations.ftp.FTPFileSystem'},
'hdfs': {'class': 'fsspec.implementations.hdfs.PyArrowHDFS',
'err': 'pyarrow and local java libraries required for HDFS'},
'webhdfs': {'class': 'fsspec.implementations.webhdfs.WebHDFS',
'err': 'webHDFS access requires "requests" to be installed'},
's3': {'class': 's3fs.S3FileSystem', 'err': 'Install s3fs to access S3'},
'adl': {'class': 'adlfs.AzureDatalakeFileSystem',
'err': 'Install adlfs to access Azure Datalake Gen1'},
'abfs': {'class': 'adlfs.AzureBlobFileSystem',
'err': 'Install adlfs to access Azure Datalake Gen2 and Azure Blob Storage'},
'az': {'class': 'adlfs.AzureBlobFileSystem',
'err': 'Install adlfs to access Azure Datalake Gen2 and Azure Blob Storage'},
'cached': {'class': 'fsspec.implementations.cached.CachingFileSystem'},
'blockcache': {'class': 'fsspec.implementations.cached.CachingFileSystem'},
'filecache': {'class': 'fsspec.implementations.cached.WholeFileCacheFileSystem'},
'simplecache': {'class': 'fsspec.implementations.cached.SimpleCacheFileSystem'},
'dask': {'class': 'fsspec.implementations.dask.DaskWorkerFileSystem',
'err': 'Install dask distributed to access worker file system'},
'github': {'class': 'fsspec.implementations.github.GithubFileSystem',
'err': 'Install the requests package to use the github FS'},
'git': {'class': 'fsspec.implementations.git.GitFileSystem',
'err': 'Install pygit2 to browse local git repos'},
'smb': {'class': 'fsspec.implementations.smb.SMBFileSystem',
'err': 'SMB requires "smbprotocol" or "smbprotocol[kerberos]" installed'},
'jupyter': {'class': 'fsspec.implementations.jupyter.JupyterFileSystem',
'err': 'Jupyter FS requires requests to be installed'},
'jlab': {'class': 'fsspec.implementations.jupyter.JupyterFileSystem',
'err': 'Jupyter FS requires requests to be installed'}}
----
If you don't see your file system supported, there are a few options ranging from writing a new spec for FSSPEC, to using a FUSE filesystem layer, or copying the data to a support file system.
Since I'm focused on experimentation, I decided to install all of the extra packages from https://github.com/intake/filesystem_spec/blob/master/setup.py[https://github.com/intake/filesystem_spec/blob/master/setup.py] as part of my Dockerfile in <<ex_install_all_fs>>. If we just wanted to support our example (reading from http and writing to S3 compatible FS) we could simplify that to <<ex_install_just_s3_http>>.
[[ex_install_just_s3_http]]
====
[source, bash]
----
pip install fsspec[s3] aiohttp
----
====
Often distributed file systems require some level of configuration, although sometimes this configuration is "hidden" from the end user so it is not always as visible. With Dask, the configuration needs to be specified along with each reading/writing operation which makes the configuration more visible.
[NOTE]
====
In Hadoop based systems, configuration is often read from a combination of environment variables and mystery XML files, which, when working can feel like magic -- but keep in mind, the most difficult configuration to debug is the configuration you can not find.
====
Since we're pulling our data over public http, we don't need special configuration for that. However, for our write side, I'm using minio (an S3-compatible file system) which needs configuration. The endpoint_url is the service name from `helm ls -n minio` plus [namespace].svc.cluster.local. The key and secret are specified during the install (which we did in the previous post).
[minio_options]
====
[source, python]
----
include::../subrepos/scalingpythonml/dask-examples/Dask - Explore S3MinIO.py[tags=minio_storage_options]
----
====
We'll use these storage options in the next section when writing data to our MinIO server.
[WARNING]
====
The first time I did this, I was unable to figure out what was going on for a few hours because I had "anon": "false", and the false string was automatically converted to the true boolean value.
====
Sometimes data can also come from or be written to things that are even less like file systems than the web, such as databases. In Dask, these are represented in a way closer to how file formats are represented, which is what we are going to explore next.
== (File) Formats
Dask has built in support for a variety of formats on top of the different file systems. Both the Bag and DataFrame APIs have their own IO functions (https://docs.dask.org/en/latest/bag-creation.html[bag IO] & https://docs.dask.org/en/latest/dataframe-api.html#create-dataframes[dataframe IO]).
In our case, the input format we've got is JSON and the target output format is Parquet. Dask DataFrame's IO library supports both of those formats so we'll use the DataFrame API. We could also do this with the Bag API.
=== Reading
To load the data we need to specify the files we want to load. On file systems that support listing (like S3, HDFS, local, etc.), we can use wild cards, but when using a file system without listing support we need to create a list of all of the files.
[source, python]
----
include::../subrepos/scalingpythonml/dask-examples/Dask - Explore S3MinIO.py[tags=make_file_list]
----
When I have a number of different inputs, I like to start with loading just the first file to explore the schema.
[source, python]
----
include::../subrepos/scalingpythonml/dask-examples/Dask - Explore S3MinIO.py[tags=load_data]
----
After loading our initial input, calling "head" on the distributed DataFrame lets us see what's going on.
[source, python]
----
df.head()
----
Note the result of doing this (in IPython/Jupyter) is displayed using the normal pandas display logic, resulting in a nice image <<dfheadimg>>.
[[dfheadimg]]
image::/images/a-quick-look-at-df-io-basic-etl-w-json-over-http-to-csv-and-parquet-on-minio-df-head.png[Image of Dataframe display in the notebook]
If you've called `df.head` in Spark, you'll note this is a much nicer default view. That being said the data needs a bit of cleaning up.
=== Some Quick Tidying Up
As we can see, there is nested JSON data in the DataFrame. I would like to partition on the project name so that, later, we can play around with data per-project without having to load everything (although I don't think there is any automated filter push down). However, we can't partition using a column that is a nested data structure, so we need to extract the project name.
[source, python]
----
include::../subrepos/scalingpythonml/dask-examples/Dask - Explore S3MinIO.py[tags=cleanup]
----
=== Writes
The write side looks very similar to the read side, but we're going to use the minio_storage_options object we created earlier.
[source, python]
----
include::../subrepos/scalingpythonml/dask-examples/Dask - Explore S3MinIO.py[tags=write]
----
Not all of the Dask formats support partitioned writes. When a format does not support partition_on or other partioned writes, Dask will need to either all of the data back to either a single executor or the client Python process. This can cause failures with large datasets.
=== Compression
Data is often stored in compressed formats, and the same library used to abstract file system access in Dask also abstracts compression. Some compression algorithms support random reads, but many do not. For people coming from the Hadoop ecosystem this can be thought of as the impact on "splitable."
Just because the underlying compression algorithm may support random reads does not mean that the FSSPEC wrapper will. Unfortunately, there is no current, easy way to check what a compression format supports besides testing it out or reading the source code.
[WARNING]
====
Dask does not support "streaming" non-random access input formats. This means that the data inside a file must be able to fit entirely in memory.
====
== Conclusion
Dask I/O integrates pretty well into much of the existing "big data" ecosystem, although the methods of specifying configuration are a little bit different. Some nested data structures can be difficult to represent in certain formats with Dask, although as the Python libraries for these formats continue to improve so will Dask's support.Now that we’ve got Dask installed, it’s time to try some simple data preparation and extract, transform, load(ETL). While ETL is often not the most exciting thing, getting data is the first step of most adventures. Data tools don’t exist in a vacuum; the data normally comes from somewhere else, and the data or models we make need to be useable with other tools. Because of this, the formats and systems that a tool can interact with can make a difference between it being a fit or needing to keep looking. To simplify your life with I/O, you should make sure your notebook (or client) runs inside the same cluster as the workers.Tagging my ARM NVidia Jetson machines with GPUs in my Kubernetes (k3s) cluster2021-02-22T00:00:00-08:002021-02-22T00:00:00-08:00https://scalingpythonml.com/2021/02/22/tagging-arm-nvidia-jetson-machines-with-gpus-in-my-k3s-cluster= Tagging my ARM NVidia Jetson machines with GPUs in my Kubernetes (k3s) cluster
We previously setup the cluster with GPUs, but since we want to get into using GPU acceleration, it's important to be able to request these resources from the Kubernetes scheduler. Tagging the nodes will help Kubernetes tell the different between the Raspbery Pi and Jetson Xavier in <<my-cluster-img>>. Depending on your type of GPUs there are options for automatically labeling your nodes. Since I've got NVidia boards we'll use the k8s-device-plugin to do the labeling of the nodes. For our machines, though, the containers out of the box do not run on ARM and the code does not detect Jetson chips.
[[my-cluster-img]]
image::/images/pi-and-jetson-IMG_0629.jpg[Image of Raspbery Pis on top with Jetson Xaviers down bellow]
== Disabling un-needed checks
The Wind River folks have link:$$https://blogs.windriver.com/wind_river_blog/2020/06/nvidia-k8s-device-plugin-for-wind-river-linux/$$[ published a series of patches for NVidia] and instructions. You can apply these patches by running the commands in <<pathc_ex>>.
.Apply the Wind River patches to the ARM tagging.
[[patch_ex]]
====
[source, bash]
----
patch -p1 < 0001-arm64-add-support-for-arm64-architectures.patch
patch -p1 < 0002-nvidia-Add-support-for-tegra-boards.patch
----
====
The third patch doesn't quite apply cleanly, as the loading NVML code has changed a bit (namely, failOnInitErrorFlag has been added). However, if you take a look at the patch, you can manually apply it by looking for the "log.Println("Loading NVML")" statement and replacing that chunk of code with the new code (indicated by the +s in the patch file).
== Building the arm image
Building the image with ARM support is now relatively simple. If you're running on an ARM machine you can just build as normal, e.g. `docker build -t holdenk/k8s-device-plugin-arm:v0.7.0.1 -f ./docker/arm64/Dockerfile.ubuntu16.04 .`
Otherwise, assuming you've set up cross-building, you can use buildx and just specify one platform, `docker buildx build -t holdenk/k8s-device-plugin-arm:v0.7.0.1 --platform linux/arm64 --push -f ./docker/arm64/Dockerfile.ubuntu16.04 .`
== Building a multi-arch image
If you have a mix of ARM and x86 machines in your cluster (as I do), having just an ARM image makes deployment a bit difficult. Thankfully we can update the Dockerfile to make it multi-arch by adding the `ARG TARGETARCH` and taking out the hardcoded arm64 references. For the `--platform=arm64` we can just go ahead and remove them, and for the wget, we can replace `arm64` with `${TARGETARCH}`.
Now, assuming you've got your multi-arch Docker build environment set up, you can cross-build this with, `docker buildx build -t holdenk/k8s-device-plugin:v0.7.0.1 --platform linux/arm64,linux/amd64 --push -f ./docker/multi/Dockerfile.ubuntu16.04 .`.
== Updating the YAML & deploying.
Once you've built your container image, you'll need to update the image in `nvidia-device-plugin.yml` to point to your custom version. You can then deploy it with `kubectl apply -f nvidia-device-plugin.yml`.
== Conclusion
Tagging your nodes with GPU resources is an important part of being able to take advantage of your cluster resources. While the NVidia tagger does not yet support Jetson boards out of the box, there are only a few small patches needed to get it working.We previously setup the cluster with GPUs, but since we want to get into using GPU acceleration, it’s important to be able to request these resources from the Kubernetes scheduler. Tagging the nodes will help Kubernetes tell the different between the Raspbery Pi and Jetson Xavier in [my-cluster-img]. Depending on your type of GPUs there are options for automatically labeling your nodes. Since I’ve got NVidia boards we’ll use the k8s-device-plugin to do the labeling of the nodes. For our machines, though, the containers out of the box do not run on ARM and the code does not detect Jetson chips.Running Spark Jupyter Notebooks Client Mode inside of a Kubernetes Cluster (with ARM for Extra Fun)2020-12-21T00:00:00-08:002020-12-21T00:00:00-08:00https://scalingpythonml.com/2020/12/21/running-a-spark-jupyter-notebooks-in-client-mode-inside-of-a-kubernetes-cluster-on-arm= Running Spark Jupyter Notebooks Client Mode inside of a Kubernetes Cluster (with ARM for Extra Fun)
Having your Spark Notebook inside the same cluster as the executors can reduce network errors and improve uptime. Since these network issues can result in job failure, this is an important consideration. This post assumes that you've already set up the foundation JupyterHub inside of Kubernetes deployment; link:$$https://scalingpythonml.com/2020/12/12/deploying-jupyter-lab-notebook-for-dask-on-arm-on-k8s.html$$[the Dask-distributed notebook blog post covers that if you haven't].
I like to think of this as washing my dog (Timbit) is a lot easier inside of the bath-tun than trying to wash him outside. Although it can take a bit of work to get him inside the tub <<timbit-tub-img>>.
[[timbit-tub-img]]
image::/images/timbit-in-the-tub-IMG_0589.jpg[Timbit in the bath tub]
If you're interested my link:$$https://www.youtube.com/watch?v=a7hDZxisuAk&list=PLRLebp9QyZtapJnz4cpDctnQ1i_qUmeap&index=1$$[YouTube playlist of Get Spark Working with Notebook inside my Kubernetes (K8s/K3s) ARM cluster ] shows the journey I went on to get this working.
A lot of my blog posts come out of my link:$$https://www.youtube.com/user/holdenkarau$$[Open Source Live Streams] (which even include Timbit sometimes).
To get a Spark notebook working inside of the cluster, we need to set up a few different things. The first step, similar to dask-kubernetes, is building a container with Jupyter and Spark installed. We also need to make a container of Spark for the executors. In addition to the containers, we need to set up permissions on the cluster and ensure that the executors that your Spark driver will launch have a way to talk to the driver in the notebook.
[NOTE]
====
It may seem like there are extra steps here compared to dask-kubernetes. Dask-kubernetes automates some service creation, which allows for communication between the scheduler, executors, and the notebook.
====
== Building the Containers
We need two containers, one with Jupyter and Spark installed together and another with just Spark. Since we're working in Python, there are some extra Python libraries we want to install as well (PyArrow, pandas, etc.) If you've got a specific version of a library that your project depends on, you'll want to add it to both the Jupyter Spark driver container and the executor containers.
To start with we'll download link:$$http://spark.apache.org/$$[Apache Spark] and decompress it, as shown in <<dlspark>>, so that we can copy the desired parts inside our containers.
.Download Spark
[[dlspark]]
====
[source, bash]
----
include::../subrepos/scalingpythonml/spark/containers/build.sh[tags=dlspark]
----
====
Now that we have Spark downloaded we can start customizing our `Dockerfiles`.
=== Building the Jupyter Spark Container
The easiest way to build a Jupyter Spark container is to install Spark on top of the base Jupyter container. If you're running on ARM, you'll need to first cross-build the base Jupyter container (see my link:$$https://scalingpythonml.com/2020/12/12/deploying-jupyter-lab-notebook-for-dask-on-arm-on-k8s.html$$[instructions in the previous post]).
In my case I've custom built the link:$$https://github.com/jupyterhub/zero-to-jupyterhub-k8s/tree/master/images/singleuser-sample$$[single-user sample Docker container] from zero-to-jupyterhub-k8s to `holdenk/jupyter-hub-magicsingleuser-sample:0.10.2` as I needed ARM support. If you don't need to cross-build your custom container, you can use the pre-built container at `jupyterhub/k8s-singleuser-sample` as the basis for yours.
Since Spark needs Java to run, I decided to look at the link:$$https://github.com/docker-library/openjdk/blob/master/11/jdk/slim-buster/Dockerfile$$[jdk11 slim dockerfile] to see how to install Java in a dockerfile well. If you're an object-oriented person, you might be wishing we had multiple-inheritence with Dockerfiles, but that doesn't work. In addition to the JDK11 dockerfile, I looked at Spark's own Dockerfiles (includign PySpark) and the resulting Juptyer Spark Container specification is shown in <<spark_notebook_dockerfile>>.
.Dockerfile to add Spark on top of the Jupyter Notebook container.
[[spark_notebook_dockerfile]]
====
[source, dockerfile]
----
include::../subrepos/scalingpythonml/spark/containers/notebook/Dockerfile[]
----
====
Since the Dockerfile copies parts of Spark in, remember to save it at the root of where you decompressed the Spark tarball.
If you're not cross building, you can build this with a regular `docker build`, in my case since I'm targetting arm and x86 I did built it as shown in <<build_spark_nb>>.
.Build Spark notebook container
[[build_spark_nb]]
====
[source, bash]
----
include::../subrepos/scalingpythonml/spark/containers/build.sh[tags=build-notebook]
----
====
[NOTE]
====
An alternative would have been to take the JDK-11 containers as a starting point, and install Jupyter on top of it, but when I tried that I found it more complicated.
====
This gives us a container with both Spark and the base notebook layer together. For the executors, we don't want to bother shipping Jupyter, so we'll build a seperate container for the executors.
=== Building the Executor Container
Spark does not ship pre-built containers for its executors, so regardless of which arch you’re using, you will need to build the executor containers.
If you're building multi-arch containers, you will need to update Spark's docker image tool. You will need to change the buildx option to push the images by adding "--push" to the docker buildx commands in the script for ./bin/docker-image-tool.sh.
Spark's Python container Dockerfile installs an older version of Python without any dependencies, so you will want to customize your Python container setup, as well. My Dockerfile is shown in <<spark_exec_dockerfile>>.
.Dockerfile customizing PySpark setup
[[spark_exec_dockerfile]]
====
[source, dockerfile]
----
include::../subrepos/scalingpythonml/spark/containers/python-executor/Dockerfile[]
----
====
You'll see this file references `pysetup.sh` which installs Python using Miniforge so we can support arm as shown in <<pysetupsh>>.
.Setup python
[[pysetupsh]]
====
[source, bash]
----
include::../subrepos/scalingpythonml/spark/containers/pysetup.sh[]
----
====
You will want to make your Dockerfile install the dependencies for your program while making sure to select the same version of Python that you have in your Jupyter container, so you may need to modify those two examples.
Once you've configured your enviroment, you can build your Spark image using the `docker-image-tool` that ships with Spark as shown in <<build_exec_containers>>.
.Build the exec containers
[[build_exec_contianers]]
====
[source, bash]
----
include::../subrepos/scalingpythonml/spark/containers/build.sh[tags=build_exec_containers]
----
====
[WARNING]
====
Some parts of Spark may assume a specific layout of the container, e.g. in Spark 3.1 the decommissioning integration makes certain assumptions, so be careful when making changes.
====
== Setting up Kubernetes Permissions
The driver program needs the ability to launch new pods for executors. To allow launching, create a service account or give permissions to the default service account (SA) . In my case, I decided to add permissions to the "dask" service account since the JupyterHub launcher (covered later) doesn't support different service accounts depending on the notebook. I also created a special "spark" namespace to make it easier to watch what was happening. My namespace and SA setup is shown in <<setupsa>>.
.Setup up namespace and service account
[[setupsa]]
====
[source, bash]
----
include::../subrepos/scalingpythonml/spark/setup_notebook.sh[tags=setupsa]
----
====
== Creating a Service (Allowing Driver-Executor Communication)
Spark depends on the executors connecting back to the driver for both the driver its self and the driver's BlockManager. If your driver is in a different namespace, the easiest way to allow communication is to create a service to let the executors connect to the driver.
.The Spark Driver Service
[[drvier_svc]]
====
[source, bash]
----
include::../subrepos/scalingpythonml/spark/driver-service.yaml[]
----
====
.Apply the Spark Driver Service
[[drvier_svc_apply]]
====
[source, bash]
----
include::../subrepos/scalingpythonml/spark/setup_notebook.sh[tags=setup_service]
----
====
These port numbers are arbitrary (you can pick different ones), but you'll need to remember them when configuring your SparkContext.
== Configuring Your JupyterHub Launcher
Now that you have all of the foundational components set up, it's time to add them to your JupyterHub launcher. I did this by adding the `Spark 3.0.1` option to the `profileList` in my `config.yaml` shown in <<my-jupyter-config>>.
.Combined Spark and Dask Jupyter Config
[[spark-jupyter-config]]
====
[source, yaml]
----
include::../subrepos/zero-to-jupyterhub-k8s/my-config.yaml[]
----
====
You can then upgrade your previous deployment with `helm upgrade --cleanup-on-fail --install $RELEASE jupyterhub/jupyterhub --namespace $NAMESPACE --create-namespace --version=0.10.2 --values config.yaml`.
== Configuring Your SpakContext
Now that you can launch a notebook with everything needed for Spark, it's time to talk about configuring your SparkContext to work in this environment. You'll need more configuration than you can get through the SparkContext constructor directly, so you will also need to import the SparkConf. Your imports might look like <<sparkImports>>.
.Spark Imports
[[sparkImports]]
====
[source, python]
----
include::../subrepos/scalingpythonml/spark/PySparkHelloWorldInsideTheCluster.py[tags=sparkImports]
----
====
In my cluster, the K8s API is available at `https://kubernetes.default`, so I start my configuration as in <<makeSparkConf>>.
.Start of Spark Conf
[[makeSparkConf]]
====
[source, python]
----
include::../subrepos/scalingpythonml/spark/PySparkHelloWorldInsideTheCluster.py[tags=makeSparkConf]
----
====
Since there are no pre-built docker images for Spark, you'll need to configure the container image used for the executor, mine is shown in <<configContainer>>.
.Configure Container
[[configContainer]]
====
[source, python]
----
include::../subrepos/scalingpythonml/spark/PySparkHelloWorldInsideTheCluster.py[tags=configureContainer]
----
====
Normally Spark assigns ports randomly for things like the driver and the block manager, but we need to configure Spark to bind to the correct ports, and also have the executors connect to the service we've created instead of trying to connect back to the hostname of the driver. My service configuration is shown in <<sparkNetConf>>.
.Spark Network Conf
[[sparkNetConf]]
====
[source, python]
----
include::../subrepos/scalingpythonml/spark/PySparkHelloWorldInsideTheCluster.py[tags=configureService]
----
====
In addition to that, you'll need to tell Spark which namespace it has permission to create executors in, shown in <<sparkNSConf>>.
.Spark Namespace Conf
[[sparkNSConf]]
====
[source, python]
----
include::../subrepos/scalingpythonml/spark/PySparkHelloWorldInsideTheCluster.py[tags=configureNamespace]
----
====
While it's not essential, configuring an application name makes debugging much easier. You can do this with `.set("spark.app.name", "PySparkHelloWorldInsideTheCluster")`.
== Conclusion
The process of adding a Spark notebook to your JupyterHub launcher is a little more involved than it is for typical notebooks because of the required permissions and network connections. Moving inside the cluster from outside of the cluster can offer many advantages, especially if your connection to the cluster goes over the internet. If you aren't familiar with Spark, there is a new version of link:$$https://amzn.to/2WxB1I1$$[_Learning Spark_] by my former co-workers (or you can buy the link:$$https://amzn.to/2Ww3s98$$[old one I co-wrote], but it's pretty out of date), along with Rachel & my link:$$https://amzn.to/3paoE0L$$[_High Performance Spark_]. Up next, I'm planning on deploying Ray on the cluster, then jumping back to Dask and with the GitHub and BitCoin data.Having your Spark Notebook inside the same cluster as the executors can reduce network errors and improve uptime. Since these network issues can result in job failure, this is an important consideration. This post assumes that you’ve already set up the foundation JupyterHub inside of Kubernetes deployment; the Dask-distributed notebook blog post covers that if you haven’t.Deploying Jupyter Lab/Notebook for Dask on ARM on Kubernetes2020-12-12T00:00:00-08:002020-12-12T00:00:00-08:00https://scalingpythonml.com/2020/12/12/deploying-jupyter-lab-notebook-for-dask-on-arm-on-k8s= Deploying Jupyter Lab/Notebook for Dask on ARM on Kubernetes
In this post, we are going to go through how to deploy Jupyter Lab on ARM on Kubernetes. We'll also build a container for use with Dask, but you can skip/customize this step to meet your own needs. In the link:$$/2020/11/03/a-first-look-at-dask-on-arm-on-k8s.html$$[previous post, I got Dask on ARM on Kubernetes working], while using remote access to allow the Jupyter notebook to run outside of the cluster. After running into a few issues from having the client code outside of the cluster, I decided it was worth the effort to set up Jupyter on ARM on K8s.
== Rebuilding the JupyterHub Containers
The default Jupyter containers are not yet cross-built for ARM. If your primary development machine is not an ARM machine, you'll want to set up Docker buildx for cross-building, and I've got some instructions on how to do this.
[WARNING]
====
One of Jupyter's containers uses cgo to build a small bootstrap program: this program will not build under QEMU. If you get an error building your containers check out link:$$/2020/12/11/some-sharp-corners-with-docker-buildx.html$$[my instructions on cross-building with real hosts.] You can also cross-build without QEMU (discussed in the same post).
====
JupyterLab uses a special program called link:$$https://pypi.org/project/chartpress/$$[ChartPress] to build it's images. This program's compose building capabilities are similar to Docker's but are Python focused. To make ChartPress use Docker buildx, you'll want to clone the repo `git clone git@github.com:jupyterhub/chartpress.git` and replace the following line in chartpress.py
[source, diff]
----
- cmd = ['docker', 'build', '-t', image_spec, context_path]
+ cmd = ['docker', 'buildx', 'build', '-t', image_spec, context_path, "--platform", "linux/arm64,linux/amd64", "--push"]
----
Then you can pip install your local version:
pip install -e .
Now that you have ChartPress set up to cross-build for ARM64 and AMD64, you can check out link:$$https://github.com/jupyter/docker-stacks$$[docker-stacks repo] and make a few changes. First is the base notebook container targets a specific non-cross platform hash, so we'll change the "FROM":
[source, diff]
----
-ARG ROOT_CONTAINER=ubuntu:focal-20200925@sha256:2e70e9c81838224b5311970dbf7ed16802fbfe19e7a70b3cbfa3d7522aa285b4
+#ARG ROOT_CONTAINER=ubuntu:focal-20200925@sha256:2e70e9c81838224b5311970dbf7ed16802fbfe19e7a70b3cbfa3d7522aa285b4
+ARG ROOT_CONTAINER=ubuntu:focal
----
Next, is that Miniconda doesn't have full ARM64 support, so you'll want to swap the Miniconda install to Miniforge:
[source, diff]
----
-RUN wget --quiet https://repo.continuum.io/miniconda/Miniconda3-py38_${MINICONDA_VERSION}-Linux-x86_64.sh && \
- echo "${miniconda_checksum} *Miniconda3-py38_${MINICONDA_VERSION}-Linux-x86_64.sh" | md5sum -c - && \
- /bin/bash Miniconda3-py38_${MINICONDA_VERSION}-Linux-x86_64.sh -f -b -p $CONDA_DIR && \
- rm Miniconda3-py38_${MINICONDA_VERSION}-Linux-x86_64.sh && \
+RUN export arch=$(uname -m) && \
+ if [ "$arch" == "aarm64" ]; then \
+ arch="arm64"; \
+ fi; \
+ wget --quiet https://github.com/conda-forge/miniforge/releases/download/4.8.5-1/Miniforge3-4.8.5-1-Linux-${arch}.sh -O miniforge.sh && \
+ chmod a+x miniforge.sh && \
+ ./miniforge.sh -f -b -p $CONDA_DIR && \
+ rm miniforge.sh && \
----
The docker-stacks notebooks are built with a Makefile so to build the base image, you'll execute `OWNER=holdenk make build/base-notebook`, where you set "OWNER" to your dockerhub username.
In addition to the docker-stack images you'll want to rebuild the zero-to-jupyterhub-k8s and jconfigurable-http-proxy.
With zero-to-jupyter-hub-k8s you'll also need to change `images/singleuser-sample/Dockerfile` to use the docker-stack image you built (e.g. in mine I replaced `FROM jupyter/base-notebook:45bfe5a474fa` with `FROM holdenk/base-notebook:latest`) . The py-spy package will also need to be removed from the images/hub/requirements.txt file since it is not cross-built (and it is optional anyway). zero-to-jupyterhub-k8s is built with `chartpress`, so you will just build with a custom image prefix as in <<ex_chartpress_build>>.
[[ex_chartpress_build]]
====
[source, bash]
----
chartpress --image-prefix holdenk/jupyter-hub-magic --force-build
----
====
With jconfigurable-http-proxy no changes are necessary to the project it's self, and you can directly run `docker buildx build -t holdenk/jconfigurable-http-proxy:0.0.1 . --platform linux/arm64,linux/amd64 --push`. Note this is a different build command as the proxy project does not use chartpress.
=== Configuring the container images
Now that you've built the container images, we need to configure our helm chart to use them. When you run `chartpress` inside of the zero-to-jupyterhub-k8s repo, it updates the jupyterhub values for the helm chart. You can either use this new chart by following the helm instruction on using a link:$$https://v2.helm.sh/docs/chart_repository/$$[chart repository] (e.g. `helm install $RELEASE ~/repos/scalingpythonml/scratch/zero-to-jupyterhub-k8s/jupyterhub --namespace $NAMESPACE --create-namespace --values config.yaml`) or you can use the existing published helm chart and override the images.
To install with the existing published chart you can run <<install_existing>>.
[[install_existing]]
====
[source, bash]
----
include::../subrepos/zero-to-jupyterhub-k8s/install.sh[]
----
====
To override the images, you'll need to specify the images in your configuration to helm when your doing your install later as in <<img-config>>. I put this in a file called `config.yaml`.
[[img-config]]
====
[source, yaml]
----
include::../subrepos/zero-to-jupyterhub-k8s/my-image-config.yaml[]
----
====
We'll keep building on the above configuration, since we need to do more than just override the images.
== Setting up a SSL Certificate
Jupyter expects an SSL certificate for its endpoint. If you don't have cert manager installed, the guide at link:$$https://opensource.com/article/20/3/ssl-letsencrypt-k3s$$[https://opensource.com/article/20/3/ssl-letsencrypt-k3s shows how to configure SSL using Let's Encrypt]. If you don't have a publicly accessible IP and domain, you'll need to use an alternative provider. Once you have cert-manager installed it's time to request the certificate. The YAML for my certificate request for holdenkarau.mooo.com is shown in <<cert-req>>.
[[cert-req]]
====
[source, yaml]
----
include::../subrepos/scalingpythonml/certificate-stuff/le-prod-cert.yaml[]
----
====
Once you make your certificate request you can apply with `kubectl apply -f le-prod-cert.yaml`, and monitor it with `kubectl get certificates -n jhub -w -o yaml`. If your certificate does not become "Ready", you should check out the link:$$https://cert-manager.io/docs/faq/acme/$$[cert-manager debugging guide].
Now that you've got your SSl certificate stored as a secret in your cluster, you'll need configure your JupyterHub ingress to use it by adding <<ingres-config>> to your `config.yaml`.
[[ingress-config]]
====
[source, yaml]
----
include::../subrepos/zero-to-jupyterhub-k8s/my-config-no-imgs.yaml[tags=ssl_cert_config]
----
====
== Making Jupyter work without a second public IP
Since my home only has one public IP address, I changed the service type from LodeBalancer to NodePort, since I did not have a spare public IP to assign to Jupyter.
[source, yaml]
----
include::../subrepos/zero-to-jupyterhub-k8s/my-config-no-imgs.yaml[tags=proxy_service]
----
With this change the service deployed successfully and Traefik (installed in K3s by default) was able to route the requests.
== Setting up Authentication
I was unable to get the JupyterHub GitHub plugin working, but it looks like there is
an link:$$https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/1871$$[outstanding issue to refactor the auth configuration,]
so for now I just hard coded what is known as "dummy" authentication. I recommend using a different kind of authentication as soon as the refactoring is complete.
[source, yaml]
----
include::../subrepos/zero-to-jupyterhub-k8s/my-config-no-imgs.yaml[tags=auth]
----
== Installing JupyterHub with Helm
Now you can install this with Helm:
[source, bash]
----
include::../subrepos/zero-to-jupyterhub-k8s/helm-install.sh[]
----
And now you're ready to rock and roll with JupyterHub! However, part of the config is still commented out; that's because we have not yet built the single user Dask Jupyter Docker container. If you aren't using Dask, this can be your stopping point.
== Adding Dask Support
Adding Dask Support involves configuring permissions to make sure the notebook can create executors, building an image to work with your JupyterHub launcher, and adding the image to as a profile to your launcher.
### Permissions
The default service account used will probably not have the right permissions to launch dask-distributed workers.
To start with create a specification for the what permission your notebook is going to need, I called my `setup.yaml` (not very creative I know) as in <<ex-perm-yaml>>
[[ex-perm-yaml]]
====
[source, yaml]
----
include::../subrepos/scalingpythonml/dask-examples/setup.yaml[]
----
====
Now that you've specified the permissions you can go ahead and create the accounts, namespaces, and bindings to wire everything together as in <<ex-setup-namespace>>.
[[ex-setup-namespace]]
====
[source, bash]
----
include::../subrepos/scalingpythonml/dask-examples/setup.sh[tags=setup_namespace]
----
====
=== Building container images
The link:$$https://github.com/dask/dask-docker$$[dask-docker] project contains a notebook container file; however, it is not designed for use with JupyterHub's launcher. The first change needed is commenting out the auto start in `notebook/prepare.sh`. The other required change is swapping the Dockerfile with your cross-built -- single-user-sample. I updated mine to also install some helpful libraries as in <<my_dask_nb_dockerfile>>:
[[my_dask_nb_dockerfile]]
====
[source, dockerfile]
----
include::../subrepos/dask-docker/notebook/Dockerfile[]
----
====
As before you can build this with the standard `docker buildx` commands.
=== Configuring your Jupyter
To enable Dask with Jupyter you'll need to update your configuration to both specify the service account and add a profile for for the notebook container you've created. In my situation this looks like:
Once your done with your config changes, you need to update your install with `helm upgrade --cleanup-on-fail --install $RELEASE jupyterhub/jupyterhub --namespace $NAMESPACE --create-namespace --version=0.10.2 --values config.yaml`
## Conclusion
All in all my confiugration file looks something like <<my-total-config>> (except with diffferent secrets). Your config file will look a bit different depending on the choices you made while running through this config.
[[my-total-config]]
====
[source, yaml]
----
include::../subrepos/zero-to-jupyterhub-k8s/my-config-dask.yaml[]
----
====
Now you're ready to rock-and-role with more stable Dask jobs, that can survive when your notebook goes to sleep or your home internet connection goes out between you and your cluster. The next blog post will explore how this is a bit more involved for building a JupyterHub launcher container for a Spark notebook.In this post, we are going to go through how to deploy Jupyter Lab on ARM on Kubernetes. We’ll also build a container for use with Dask, but you can skip/customize this step to meet your own needs. In the previous post, I got Dask on ARM on Kubernetes working, while using remote access to allow the Jupyter notebook to run outside of the cluster. After running into a few issues from having the client code outside of the cluster, I decided it was worth the effort to set up Jupyter on ARM on K8s.Some sharp corners with docker buildx (especially with qemu)2020-12-11T00:00:00-08:002020-12-11T00:00:00-08:00https://scalingpythonml.com/2020/12/11/some-sharp-corners-with-docker-buildxHave you been trying out Docker's wonderful new buildx with QEMU, but are getting an unexpected "exec user process caused: exec format error" or strange segfaults on ARM? If so, this short and sweet blog post is for you. I want to be clear: I think buildx with qemu is amazing, but there are some sharp edges to keep your eyes out on.
## Cross building sharp edges
First, there are some issues when using cgo (and less often gcc) with QEMU which can sometimes cause segfaults. For me this showed up as "qemu: uncaught target signal 4 (Illegal instruction) - core dumped." Future versions of cgo, gcc or QEMU may work around these issues, but if you find yourself getting errors while building what seems like a trivial example, there's a good chance you've run into this. I've dealt with this problem by using an actual ARM machine for my cross-building.
The other sharp edge is that you can accidentally build a native architecture Docker image labeled as the cross-architecture image, and only find out at runtime. This can happen when the FROM label in your Dockerfile specifies a specific hash. In this case, the easiest thing to do is specify a version tag instead. While it won't fix the problem, using an actual target architecture machine for your building will let you catch this earlier on.
## Solution
Don't despair, though, instead of QEMU, we can use remote contexts. First, get a machine based on your target architecture. If you don't have one handy, some cloud providers offer a variety of architectures. Then, if your machine doesn't already have Docker on it, install Docker. Once you've set up docker on the remote machine, you can create a docker context for it. In my case, I have ssh access (with keys) as the root user to a jetson nano at 192.168.3.125, so I create my context as:
```bash
docker context create jetson-nano-ctx --docker host=ssh://root@192.168.3.125
```
Once you have a remote context, you can use it in a "build instance." If you have QEMU locally, as I do, it's important that the remote context is set to be used, since otherwise, we will still try to build with emulation.
```bash
docker buildx create --use --name mybuild-combined-builder jetson-nano-ctx
docker buildx create --append --name mybuild-combined-builder
```
Another random sharp edge that I've run into with Docker buildx is a lot of transient issues seem to go away when I rerun the same command (e.g., "failed to solve: rpc error: code = Unknown desc = failed commit on ref"). I imagine this might be due to a race condition because when I rerun it, Docker buildx uses caching -- but that's just a hunch.
## Conclusion
Another option, especially for GO, is to do your build on your source arch targeting your target arch. There is a Docker blog post [on that approach here.](https://www.docker.com/blog/containerize-your-go-developer-environment-part-1/) Cross-building C libraries is also an option, but more complicated.
Now you're ready to go off to the races and build with your remote machine. Don't worry you can change your build instance back to your local context (use `docker buildx ls` to see your contexts). Happy porting, everyone!
Have you run into additional sharp corners with QEMU & buildx? Let me know and I'll update this post :)Have you been trying out Docker’s wonderful new buildx with QEMU, but are getting an unexpected “exec user process caused: exec format error” or strange segfaults on ARM? If so, this short and sweet blog post is for you. I want to be clear: I think buildx with qemu is amazing, but there are some sharp edges to keep your eyes out on.A First Look at Dask on ARM on K8s2020-11-03T00:00:00-08:002020-11-03T00:00:00-08:00https://scalingpythonml.com/2020/11/03/a-first-look-at-dask-on-arm-on-k8s= A First Look at Dask on ARM on K8s
:uri-asciidoctor: http://asciidoctor.org
After getting the cluster set up in the previous post, it was time to finally play with Dask on the cluster. Thankfully, there are link:$$https://github.com/dask/dask-kubernetes$$[dask-kubernetes] and link:$$https://github.com/dask/dask-docker$$[dask-docker] projects that provide the framework to do this. Since I'm still new to Dask, I decided to start off by using Dask from a local notebook (in retrospect maybe not the best choice).
== Getting Dask on ARM in Docker
The dask-docker project gives us a good starting point for building a container for Dask, but the project's containers are only built for amd64. I started off by trying to rebuild the containers without any modifications, but it turned out there were a few issues that I needed to address. The first is that the regular conda docker image is also only built for amd64. Secondly, some of the packages that the Dask container uses are also not yet cross-built. While these problems will likely go away over the coming year, for the time being, I solved these issues by making a multi-platform condaforge docker container, asking folks to rebuild packages, and, when the packages did not get rebuilt, installing from source.
To do this I created a new Dockerfile for replacing miniconda base with miniforge:
[source, dockerfile]
----
include::../subrepos/dask-docker/miniforge/Dockerfile[]
----
Most of the logic lives in this setup script:
[source, sh]
----
include::../subrepos/dask-docker/miniforge/setup.sh[]
----
I chose to install mamba, a fast C++ reimplementation of conda, and use this to install the rest of the packages. I did this since debugging the package conflicts with the regular conda program was resulting in confusing error messages, and mamba can have clearer error messages. I created a new version of the "base" Dockerfile, from dask-docker, which installed the packages with mamba and pip when not available from conda.
[source, dockerfile]
----
include::../subrepos/dask-docker/base/Dockerfile[]
----
One interesting thing I noticed while exploring this is the link::$$https://github.com/dask/dask-docker/blob/master/base/prepare.sh$$[prepare.sh script] that is used as the entry point for the container. This script checks a few different environment variables that, when present, are used to install additional packages (Python or system) at container startup. While normally putting all of the packages into a container is best (since installations can be flaky and slow), this does allow for faster experimentation. At first glance, it seems like this still requires a Dask cluster restart to add a new package, but I'm going to do more exploring here.
== Getting Dask on Kube
With the containers built, the next step was trying to get them running on Kubernetes. I first tried the helm installation, but I wasn't super sure how to configure it to use my new custom containers and the documentation also contained warnings indicating that Dask with helm did not play well with dynamic scaling. Since I'm really interested in exploring how the different systems support dynamic scaling, I decided to install the dask-kubernetes project. With dask-kubernetes, I can create a cluster by running:
[source, python]
----
include::../subrepos/scalingpythonml/dask-examples/TestNB1.py[tags=create_in_default]
----
As I was setting this up, I realized it was creating resources in the default namespace, which made keeping track of everything difficult. So I created a namespace, service account, and role binding so that I could better keep track of (and clean up) everything:
[source, python]
----
include::../subrepos/scalingpythonml/dask-examples/setup.sh[tags=setup_namespace]
----
To use this, I rewrote added another parameter to cluster creation and updated the yaml:
[source, python]
----
include::../subrepos/scalingpythonml/dask-examples/TestNB1.py[tags=create_in_namespace]
----
The from_yaml is important, as it lets me specify specific containers and resource requests (which will be useful when working with GPUs). I modified the standard worker-spec to use the namespace and service account I created.
[source, yaml]
----
include::../subrepos/scalingpythonml/dask-examples/worker-spec.yaml[]
----
While this would work if I was _inside_ the Kubernetes cluster I wanted to start with an experimental notebook outside the cluster. This required some changes, and in retrospect is not where I should have started.
== Dask in Kube with Notebook access
There are two primary considerations when setting up Dask for notebook access on Kube. The first is where you want your notebook to run, inside the Kubernetes cluster or outside (e.g. on your machine). The second consideration is if you want the Dask scheduler to run alongside your notebook, or in a separate container inside of Kube.
The first configuration I tried was having a notebook on my local machine. At first, I could not get it working because the scheduler was running on my local machine and could not talk to the worker pods it spun up. That's why, unless you're using host networking, I recommend having the scheduler run inside the cluster. Doing this involves adding a "deploy_mode" keyword to your KubeCluster invocation and asking Dask to create a service for your notebook to talk to the scheduler.
[source, python]
----
include::../subrepos/scalingpythonml/dask-examples/TestNB2.py[tags=remote_lb_deploy]
----
Running your notebook on a local machine _may_ make getting started faster, but it comes with some downsides. It's important that you keep your client's python environment in sync with the worker/base containers. For me setting up my conda env, I ended up having to run:
[source, python]
----
include::../subrepos/scalingpythonml/dask-examples/setup.sh[tags=setup_conda]
----
Another big issue you'll likely run into is that transient network errors between your notebook and the cluster can result in non-recoverable errors. This has happened to me even with networking all inside my house, so I can imagine that it would be even more common with a VPN or a cloud provider network involved.
The final challenge that I ran into was with I/O. Some code will run in the workers and some will run on the client, and if your workers and client have a different network view or if there are resources that are only available inside the cluster (for me MinIO), the error messages can be confusing footnote:[I worked around this by setting up port-forwarding so that the network environment was the same between my local machine and the cluster. You could also expose the internal-only resources through a service and have internal & external access through the service, but I just wanted a quick stop-gap. This challenge convinced me I should re-run with my notebook inside the cluster.].
Note: you don't have to use Dask with Kubernetes, or even a cluster. If you don't have a cluster, or have a problem where a cluster might not be the best solution, Dask also supports other execution environments like multithreading and GPU acceleration. I'm personally excited to see how the GPU acceleration can be used together with Kubernetes.
== The different APIs
Dask exposes a few different APIs for distributed programming at different levels of abstraction. Dask's "core" building block is the delayed API, on top of which collections and DataFrame support is built. The delayed API is notably a lower level API than Spark's low level public APIs -- and I'm super interested to see what kind of things it enables us to do.
Dask has three different types of distributed collection APIs: Bag, DataFrame, and Array. These distributed collections map relatively nicely to common Python concepts, and the DataFrame API is especially familiar.
Almost footnote:[You can use the actor API within the other APIs, but it is not part of the same building blocks.] separate from the delayed and collections APIs, Dask also has an (experimental) Actor API. I'm curious to see how this API continues to be developed and used. I'd really like to see if I can use it as a parameter server.
To verify my cluster was properly set up I did a quick run through the tutorials for the different APIs.
== Next Steps
Now that I've got Dask on Kube running on my cluster I want to do some cleanup and then explore more about how Dask handles dataframes, partitioning/distributing data/tasks, auto scaling, and GPU acceleration. If you've got any suggestions for things you'd like me to try out, do please get in touch :)After getting the cluster set up in the previous post, it was time to finally play with Dask on the cluster. Thankfully, there are dask-kubernetes and dask-docker projects that provide the framework to do this. Since I’m still new to Dask, I decided to start off by using Dask from a local notebook (in retrospect maybe not the best choice).Setting up K3s (lightweight Kubernetes) with Persistent Volumes and Minio on ARM2020-10-18T00:00:00-07:002020-10-18T00:00:00-07:00https://scalingpythonml.com/2020/10/18/setting-up-k3s-with-pvs-and-minio-on-armAfter the [last adventure](http://scalingpythonml.com/2020/09/20/building-the-physical-cluster.html) of getting the rack built and acquiring the machines, it was time to set up the software. Originally, I had planned to do this in a day or two, but in practice, it ran like so many other "simple" projects and some things I had assumed would be "super quick" ended up taking much longer than planned.
Software-wise, I ended up deciding on using [K3s](https://k3s.io/) for the Kubernetes deployment, and [Rook](https://rook.io/) with Ceph for the persistent volumes. And while I don't travel nearly as much as I used to, I also set up [tailscale for VPN access](https://tailscale.com/) from the exciting distant location of my girlfriend's house (and incase we ended up having to leave due to air quality).
## Building the base image for the Raspberry Pis
For the Raspberry Pis I decided to use the Ubuntu Raspberry Pi image as its base. The Raspberry Pis boot off of microsd cards, which allows us to pre-build system images rather than running through the install process on each instance. My desktop is an x86, but by following [this guide](https://docs.j7k6.org/raspberry-pi-chroot-armv7-qemu/), I was able to set up an emulation layer so I could cross-build the image for the ARM Raspberry Pis.
I pre-installed the base layer with Avahi (so the workers and find the leader), ZFS (to create a local storage layer to back our volumes), and necessary container tools. This step ended up taking a while, but I made the most of it by re-using the same image on multiple workers. I also had this stage copy over some configuration files, which didn't depend on having emulation set up.
However, not everything is easily baked into an image. For example, at first boot, the leader node installs K3s and generates a certificate. Also, when each worker first boots, it connects to the leader and fetches the configuration required to join the cluster. Ubuntu has a mechanism for this (called cloud-init), but rather than figure out a new system I went with the old school self-disabling init-script to do the "first boot" activities.
## Setting up the Jetsons & my one x86 machine
Unlike the Raspberry Pis, the Jetson AGX's & x86 machines have internal storage that they boot from. While the Jetson nano does boot from a microsd card, the images available are installer images that require user interaction to set up. Thankfully, since I wrote everything down in a shell script, it was fairly simple to install the same packages and do the same setup on the Raspberry Pis.
By default, K3s uses containerd to execute its containers. I found another interesting [blog post on using K3s on Jetsons](https://www.virtualthoughts.co.uk/2020/03/24/k3s-and-nvidia-jetson-nano/), and the main changes that I needed for the setup is to switch from containerd to docker and to configure docker to use the "nvidia" runtime as the default.
## Getting the cluster to work
So, despite pre-baking the images, and having scripts to install "everything," I ended up running into a bunch of random problems along the way. These spanned everything from hardware to networking to my software setup.
The leader node started pretty much as close to perfect as possible, and one of the two workers Raspberry Pis came right up. The second worker Pi kept spitting out malformed packets on the switch -- and I'm not really sure what's going on with that one -- but the case did melt a little bit, which makes me think there might have been a hardware issue with that one node. I did try replacing the network cable and putting it into a different port, but got the same results. When I replaced it with a different Pi everything worked just fine, so I'll debug the broken node when I've got some spare time.
I also had some difficulty with my Jetson Nano not booting. At first, I thought maybe the images I was baking were no good, but then I tried the stock image along with a system reset, and that didn't get me any further. Eventually I tried a new microsd card along with the stock image and shorting out pin 40 and it booted like a champ.
On the networking side, I have a fail-over configured for my home network. However, it seems that despite my thinking I had my router configured to fail-over only if the primary connection has an outage and not do any load-balancing otherwise, I kept getting random connection issues. Once I disabled the fail-over connections the networking issues disappeared. I'm not completely sure what's going on with this part, but for now, I can just manually do a failover if sonic goes out.
On the software side, Avahi worked fine on all of the ARM boards but for some reason doesn't seem to be working on the x86 node The only difference that I could figure was that the x86 node has a static lease configured with the DHCP server, but I don't think that would cause this issue. While having local DNS between the worker nodes would be useful, this was getting near the end of the day, so I just added the leader to the x86's node's host files and called it a day. The software issues lead us nicely into the self caused issues I had trying to get persistent volumes working.
## Getting persistent volumes working
One of the concepts I'm interested in playing with is fault tolerance. One potential mechanism for this is using persistent volumes to store some kind of state and recovering from them. In this situation we want our volumes to remain working even if we take a node out of service, so we can't just depend on local volume path provisioning to test this out.
There are many different projects that could provide persistent volumes on Kubernetes. My first attempt was with GlusterFS; however, the Gluster Kubernetes project has been "archived." So after some headaches, I moved on to trying Rook and Ceph. Getting Rook and Ceph running together ended up being quite the learning adventure; both Kris and Duffy jumped on a video call with me to help figure out what was going on. After a lot of debugging -- they noticed that it was an architecture issue -- namely, many of the CSI containers were not yet cross-compiled for ARM. We did a lot of sleuthing and found unofficial multi-arch versions of these containers. Since then, the [rasbernetes](https://github.com/raspbernetes/multi-arch-images) project has started cross-compiling the CSI containers, I've switched to using as it's a bit simpler to keep track of.
![Image of rook/ceph status reporting ok](/images/rook-ceph-works.jpeg)
<!-- From setup_rook.sh -->
```bash
pushd /rook/cluster/examples/kubernetes/ceph
kubectl create -f common.yaml
kubectl create -f rook_operator_arm64.yaml
kubectl create -f rook_cluster.yaml
kubectl create -f ./csi/rbd/storageclass.yaml
```
## Adding an object store
During my first run of [Apache Spark on the new cluster](https://www.youtube.com/watch?v=V1SkEl1r4Pg&t=6s), I was reminded of the usefulness of an object-store. I'm used to working in an environment where I have an object store available. Thankfully MinIO is available to provide an S3 compatible object store on Kube. It can be backed by the persistent volumes I set up using Rook & Ceph. It can also use local storage, but I decided to use it as a first test of the persistent volumes. Once I had fixed the issues with Ceph, MinIO deployed relatively simply [using a helm chart](https://github.com/minio/charts).
While MinIO does build docker containers for arm64 and amd64, it gives them seperate tags. Since I've got a mix of x86 machines and arm machines in the same cluster I ended up using an un-official multi-arch build. I did end up pinning it to the x86 machine for now, since I haven't had the time to recompile the kernels on the arm machines to support rbd.
<!-- From setup_minio.sh -->
```bash
# Install minio using ceph to back our storage. Deploy on the x86 because we don't have the rbd kernel module on the ARM nodes. Also we want to save the arm nodes for compute.
helm install --namespace minio --generate-name minio/minio --set persistence.storageClass=rook-ceph-block,nodeSelector."beta\\.kubernetes\\.io/arch"=amd64
# Do a helm ls and find the deployment name name
deployment_name=$(helm ls -n minio | cut -f 1 | tail -n 1)
ACCESS_KEY=$(kubectl get secret -n minio "$deployment_name" -o jsonpath="{.data.accesskey}" | base64 --decode); SECRET_KEY=$(kubectl get secret -n minio "$deployment_name" -o jsonpath="{.data.secretkey}" | base64 --decode)
# Defaults are "YOURACCESSKEY" and "YOURSECRETKEY"
mc alias set "${deployment_name}-local" http://localhost:9000 "$ACCESS_KEY" "$SECRET_KEY" --api s3v4
mc ls "${deployment_name}-local"
mc mb "${deployment_name}-local"://dask-test
```
## Getting kubectl working from my desktop
Once I had K3s set up, I wanted to be able to access it from my desktop without having to SSH to a node in the cluster. The [K3s documentation says](https://rancher.com/docs/k3s/latest/en/cluster-access/) to copy `/etc/rancher/k3s/k3s.yaml` from the cluster to your local `~/.kube/config` and replace the string localhost with the ip/DNS of the leader. Since I had multiple existing clusters I copied the part under each top-level key to the corresponding key, while changing the "default" string to k3s when copying so that I could remember the context better. The first time I did this I got the whitespace mixed up which lead to `Error in configuration: context was not found for specified context: k3s` -- but after I fixed my YAML everything worked :)
## Setting up a VPN solution
While shelter in place has made accessing my home network remotely less important, I do still occasionally get out of the house while staying within my social bubble. Some of my friends from University/Co-Op are now at a company called tailscale, which does magic with WireGuard to allow even double-natted networks to have VPNs. Since I was doing this part as an afterthought, I didn't have tailscale installed on all of the nodes, so I followed the [instructions to enable subnets ](https://tailscale.com/kb/1019/subnets)(note: I missed enabling the "Enable subnet routes" in the admin console the first time) and have my desktop act as a "gateway" host for the K8s cluster when I'm "traveling." With tailscale, set up I was able to run kubectl from my laptop at Nova's place :)
Josh Patterson has [a blog post on using tailscale with RAPIDS](https://medium.com/rapids-ai/rapids-anywhere-with-tailscale-my-mobile-device-has-an-rtx-3090-1ce0c7b443fe?source=rss----2d7ba3077a44---4).
## Conclusion & alternatives
The setup process was a bit more painful than I expected, but it was mostly due to my own choices. In retrospect, building images and flashing them was relatively slow with the emulation required on my old desktop. It would have been much easier to do a non-distributed volume deployment, like local volumes. but I want to set up PVs that I can experiment with using for fault recovery. Nova pointed out that I could have set up sshfs or NFS and could have gotten PVs working with a lot less effort, but by the time we had that conversation the sunk cost fallacy had me believing just one more "quick fix" was needed and then it would all magically work. Instead of K3s I could have used kubeadm but that seemed relatively heavyweight. Instead of installing K3s "manually" the [k3sup project](https://ma.ttias.be/deploying-highly-available-k3s-k3sup/) could have simplified some of this work. However, since I have a mix of different types of nodes, I wanted a bit more control.
Now that the cluster is set up, I'm going to test the cluster out some more with Apache Spark, the distributed computing program I'm most familiar with. Once we've made sure the basics are working with Spark, I'm planning on exploring how to get dask running. You can follow along with my adventures on my [YouTube channel over here](https://www.youtube.com/user/holdenkarau), or [subscribe to the mailing list](/mailinglist.html) to keep up to date when I write a new post.After the last adventure of getting the rack built and acquiring the machines, it was time to set up the software. Originally, I had planned to do this in a day or two, but in practice, it ran like so many other “simple” projects and some things I had assumed would be “super quick” ended up taking much longer than planned.Building the Test Cluster2020-09-20T00:00:00-07:002020-09-20T00:00:00-07:00https://scalingpythonml.com/2020/09/20/building-the-physical-clusterTo ensure that the results between tests are as comparable as possible, I'm using a consistent hardware setup whenever possible. Rather than use a cloud provider I (with the help of Nova) set up a rack with a few different nodes. Using my own hardware allows me to avoid the [noisy neighbor problem](https://en.wikipedia.org/wiki/Cloud_computing_issues#Performance_interference_and_noisy_neighbors)
with any performance numbers and gives me more control over simulating network partitions. A downside is that the environment is not as easily re-creatable.
## Building the Rack
If I'm honest, a large part of my wanting to do this project is that ever since I was a small kid, I've always dreamed of running "proper" networking gear (expired CCNA club represent). I got a [rack](https://amzn.to/32OCQEq) and some shelves. (I also got an avocado tree to put on top and a [cute kubecuddle sticker](https://www.etsy.com/listing/787021025/kubectl-corgi-kubernetes-sticker?ga_order=most_relevant&ga_search_type=all&ga_view_type=gallery&ga_search_query=kubernetes&ref=sr_gallery-1-2&organic_search_click=1&col=1) for good luck)
![Image of my rack with avocado tree on top](/images/rack.jpg)
It turns out that putting together a rack is not nearly as much like LEGO as I had imagined. Some of the shelves I got ended up being very heavy (and some did not fit), but thankfully Nova came to the rescue when things got too heavy for me to move.
After running the rack for about a day, I got a complaint from my neighbor about how loud the fan was, so I swapped it out for some [quieter fans](https://amzn.to/32NpeJN).
## The Hosts
The hosts themselves are a mixture of machines. I picked up three [Rasberry Pi 4Bs](https://www.raspberrypi.org/products/raspberry-pi-4-model-b/). I'm also running a [Jetson Nano](https://amzn.to/3kBFG6c) and three [Jetson AGX Xavier's](https://amzn.to/3jzO58O) to allow me to experiment with GPU acceleration. To support any x86 only code, I also have a small refurbed x86 present.
For storage I scrounged up some of the free flash drives I've gotten from conferences over the years to act as storage. This initial set up was not very fast, so I added some inexpensive on-sale external SSD drives.
## Setting up Kubernetes
Since I want to be able to swap between the different Python scaling tools easily, I chose Kubernetes as the base cluster layer rather than installing directly on the nodes. Since it is easy to deploy, I used K3s as the cluster manager. The biggest pain here was figuring out why the storage provisioning I was trying to use wasn't working, but thankfully Duffy came to the rescue, and we figured it out.
## What's next?
Up next, I'll start exploring how the different tools work in this environment. At the very start, I'll just run through each tool's tutorials and simulate some network and node failures to see how resilient they are. Once I've got a better handle on how each tool works, I'm planning on exploring how each of them approaches the problem of scaling pandas operations. Once that's done, we can start to get in a lot deeper and see where each tool shines. If you are interested in following along, check out my [Youtube Channel on open source programming](https://www.youtube.com/user/holdenkarau) where I will try and stream the process that goes into each post. You can also [subscribe to the mailing list for notifications on this on my books](https://www.introductiontomlwithkubeflow.com/?from=introductiontomlwithkubeflow.com) when I get something working well enough to make a new post :)
### Disclaimer
This blog does not represent any of my employers, past or present, and does not represent any of the software projects or foundations I'm involved with. I am one of the developers of Apache Spark and have [some books published on the topic](https://amzn.to/2O6KYYH) that may influence my views, but my views do not represent the project.
In as much as possible, I've used a common cluster environment for testing these different tools, although some parts have been easier to test out on Minikube.To ensure that the results between tests are as comparable as possible, I’m using a consistent hardware setup whenever possible. Rather than use a cloud provider I (with the help of Nova) set up a rack with a few different nodes. Using my own hardware allows me to avoid the noisy neighbor problem with any performance numbers and gives me more control over simulating network partitions. A downside is that the environment is not as easily re-creatable.