Hello and welcome to my adventures in Scaling Python Machine Learning (ML).
Scaling Python ML
JupyterHub has long been the de facto leader in the notebook space. Polynote is a new approach to notebooks with multi-language kernels and a focus on first-class Scala support. There is a lot of tooling to make deploying & managing JupyterHub easy for various environments. Since Polynote is relatively new, the same tooling does not yet exist. Since I am lazy, and I wanted a great Notebook for Spark (w/ Scala & Python), I figured I’d try and see if I could get my ZeroToJupyterHub deployment to launch a Polynote notebook. This way I can have multi-user Kubernetes deployments of Polynote alongside my other Kernels that I’m using (Python w/Dask, Python w/Ray, etc.). It turns out, yes it is possible, and there are of course some things I learned along the way and despite my initial thought this would "take an hour or so" it ended up being an 8 part stream, and I don’t have security set up.
While we’ve already set up Jupyterhub using zero-to-jupyterhub on Kubernetes, I wanted to expand access to my other prospective co-authors as part of trying to convince them to write with me (again) :). While I generally speaking trust my co-authors, I still like having some kind of access controls, so we don’t accidentally stomp on each-others work.
Now that we’ve got Dask installed, it’s time to try some simple data preparation and extract, transform, load(ETL). While ETL is often not the most exciting thing, getting data is the first step of most adventures. Data tools don’t exist in a vacuum; the data normally comes from somewhere else, and the data or models we make need to be useable with other tools. Because of this, the formats and systems that a tool can interact with can make a difference between it being a fit or needing to keep looking. To simplify your life with I/O, you should make sure your notebook (or client) runs inside the same cluster as the workers.
We previously setup the cluster with GPUs, but since we want to get into using GPU acceleration, it’s important to be able to request these resources from the Kubernetes scheduler. Tagging the nodes will help Kubernetes tell the different between the Raspbery Pi and Jetson Xavier in [my-cluster-img]. Depending on your type of GPUs there are options for automatically labeling your nodes. Since I’ve got NVidia boards we’ll use the k8s-device-plugin to do the labeling of the nodes. For our machines, though, the containers out of the box do not run on ARM and the code does not detect Jetson chips.
Having your Spark Notebook inside the same cluster as the executors can reduce network errors and improve uptime. Since these network issues can result in job failure, this is an important consideration. This post assumes that you’ve already set up the foundation JupyterHub inside of Kubernetes deployment; the Dask-distributed notebook blog post covers that if you haven’t.
In this post, we are going to go through how to deploy Jupyter Lab on ARM on Kubernetes. We’ll also build a container for use with Dask, but you can skip/customize this step to meet your own needs. In the previous post, I got Dask on ARM on Kubernetes working, while using remote access to allow the Jupyter notebook to run outside of the cluster. After running into a few issues from having the client code outside of the cluster, I decided it was worth the effort to set up Jupyter on ARM on K8s.
Have you been trying out Docker’s wonderful new buildx with QEMU, but are getting an unexpected “exec user process caused: exec format error” or strange segfaults on ARM? If so, this short and sweet blog post is for you. I want to be clear: I think buildx with qemu is amazing, but there are some sharp edges to keep your eyes out on.
After getting the cluster set up in the previous post, it was time to finally play with Dask on the cluster. Thankfully, there are dask-kubernetes and dask-docker projects that provide the framework to do this. Since I’m still new to Dask, I decided to start off by using Dask from a local notebook (in retrospect maybe not the best choice).
After the last adventure of getting the rack built and acquiring the machines, it was time to set up the software. Originally, I had planned to do this in a day or two, but in practice, it ran like so many other “simple” projects and some things I had assumed would be “super quick” ended up taking much longer than planned.
To ensure that the results between tests are as comparable as possible, I’m using a consistent hardware setup whenever possible. Rather than use a cloud provider I (with the help of Nova) set up a rack with a few different nodes. Using my own hardware allows me to avoid the noisy neighbor problem with any performance numbers and gives me more control over simulating network partitions. A downside is that the environment is not as easily re-creatable.
After my motorcycle/Vespa crash last year I took some time away from work. While I was out and trying to practice getting my typing speed back up, I decided to play with Ray, which was pretty cool. Ray comes out of the same1 research lab that created the initial work that became the basis of Apache Spark. Like Spark, the primary authors have now started a company (Anyscale) to grow Ray. Unlike Spark, Ray is a Python first library and does not depend on the Java Virtual Machine (JVM) – and as someone who’s spent way more time than she would like getting the JVM and Python to play together, Ray and it’s cohort seem quite promising.
Well… same-ish. It’s technically a bit more complicated because of the way the professors choose to run their labs, but if you look at the advisors you’ll notice a lot of overlap. ↩
subscribe via RSS