Supercharge Your Shallow ML Models With Hummingbird

Jul 7, 2020

#software
#machine-learning

Read time: 44 minutes

Photo by Levi Jones on Unsplash (Source)

Motivation🔗

Since the most recent resurgence of deep learning in 2012, a lion's share of new ML libraries and frameworks have been created. The ones that have stood the test of time (PyTorch, Tensorflow, ONNX, etc) are backed by massive corporations, and likely aren't going away anytime soon.

This also presents a problem, however, as the deep learning community has diverged from popular traditional ML software libraries like scikit-learn, XGBoost, and LightGBM. When it comes time for companies to bring multiple models with different software and hardware assumptions into production, things get...hairy.

How do you keep ML inference code DRY when some models are tensor-based and others are vector-based?
How do you keep the inference runtime efficiency of your traditional models competitive, as GPU-based neural networks start to run circles around them?

In Search of A Uniform Model Serving Interface🔗

I know, I know. Using microservices in Kubernetes can solve the design pattern issue to an extent by keeping things de-coupled...if that's even what you want?

But, I think that really just ignores the problem. What if you want to seamlessly deploy either an XGBoost regressor or fully-connected DNN as your service's main output? Sure, you could hot-swap the hardware your service launches onto. How about the code?

Are you going to ram in a dressed-up version of an if-else switch to use one software framework vs the other, depending on the model type?

Isn't XGBoost/LightGBM Fast Enough?🔗

Well, for a lot of use cases, it is. However, there's still a huge gap between problems requiring neural nets and problems that can be sufficiently solved with more traditional models. For the more traditional models, don't you still want to be able to use the latest and greatest computational frameworks to power your model's predictions? This would allow you to scale your model up more before you need to resort to scaling it out via redundant instances.

Enter Hummingbird🔗

Microsoft research has introduced hummingbird to bridge this gap between CPU-oriented models and tensor-oriented models. The library simply takes any of our already-trained traditional models and returns a version of that model built on tensor computations. Hummingbird aims to solve two core concerns with current ML applications:

Traditional and deep learning software libraries have different abstractions of their basic computational unit (vector vs tensor).
As a result of this difference, traditional ML libraries do not receive the same performance gains as hardware accelerators (read: GPUs) improve.

With Hummingbird, your ML pipelines will start to look cleaner. You'll know that, regardless of the algorithm, you end up with a model that creates its predictions via tensor computations. Not only that, these tensor computations will be run by the same deep learning framework of choice that your organization has likely already given allegiance to.

All of this from one function call. Not a bad deal in my book!

Let's see it in action.

Setup🔗

Let's get this part out of the way. You know the drill.

Ensure Reproducibility🔗

Let's control randomness in Numpy and PyTorch with the answer to everything in the universe.

Conjure Some Data🔗

Let's define a quick helper to quickly make some classification data, and create some data sets with it.

Thanks to Deepnote, I don't have to take up half of this notebook just printing out data shapes and distributions for you. These notebooks come with a feature-packed variable explorer which provides most of the basic EDA-style questions you will have about your data.

Check out the original version of this article to see how Deepnote automatically creates beautiful articles from their Jupyter-style notebooks.

Bring in the Bird🔗

Now, the one-liner to get your model transformed into tensor computations via Hummingbird is convert(clf, 'pytorch').

That's it. That's Hummingbird.

Just to make our comparisons even easier, let's make a quick method on top of that to automatically move it to a GPU when it's available. As some final added sugar, we'll take in a flag that forces keeping the model on CPU, should the need arise. Keep in mind though, the single call to convert() is the only interface you need to have with Hummingbird; it does all of its magic under the hood of that single function call. Once you get the model returned to you, you call predict() on it like any other traditional pythonic ML model.

Get Your Watches Out, It's Time to Time🔗

Alright, it's time to benchmark! Don't worry about a flurry of imports or method wrappers. Let's keep it simple here with the %%timeit magic command. This magic command will automatically run the code in the cell multiple times, reporting out the mean and standard deviation of the runtime samples. First, we'll time the sklearn model as-is. Then, we'll see how the Hummingbird on-CPU and on-GPU models compare.

Original: 136 ms ± 1.59 ms per loopHummingbird CPU: 1.81 s ± 16.1 ms per loop

Yikes!

Well...that was unexpected. There are no two ways about this one: Hummingbird might run slower on CPU for certain data sets. This can even be seen by some of their current example notebooks in the Hummingbird Github repo. Also, I did mention that the runtime is slower on certain data sets with intention, as it does outperform for others.

That being said, this side effect shouldn't have anyone running for the door --- remember the library's goal! The main reason for converting a model to tensor computations is to leverage hardware that excels in that area.

Spoiler alert: I'm talking about GPUs! This Deepnote notebook comes powered by an NVIDIA T4 Tensor Core GPU. Let's see how the model runs on that.

Original: 136 ms ± 1.59 ms per loopHummingbird CPU: 1.81 s ± 16.1 ms per loopHummingbird GPU: 36.6 ms ± 65.8 µs per loop

There we go! Now, we not only have a 73% mean speedup over the original, but also an order of magnitude tighter variance. The original standard deviation of runtime is 1.1% of its mean, and the GPU runtime's standard deviation is 0.18%!

Ensuring Quality with Speed🔗

Hold in your excitement for now, though. Your model could have the fastest runtime in the world; if it doesn't maintain its accuracy through the conversion, it could be utterly useless. Let's see how the predictions compare between the original model and both transformed models. For this, we turn to one of my favorite visualization libraries, seaborn.

Interesting...🔗

Not bad at all. The distribution of the deltas for the CPU-based model is rather symmetric around zero with a 3σ (note the axis scale) around 1e-7. The distribution of the deltas for the GPU-based model has a similarly small deviation, but show a non-zero bias and a skew! This is certainly interesting behavior that piques the curious mind, but it remains a small detail for all but the most precision-sensitive models.

The jury is in: Hummingbird delivers precision alongside the speedup 👍.

Check out the comparisons below from some of Microsoft's larger-scale comparisons. 🚀

The Cherry On Top🔗

Oh, and by the way, you also automatically plug into all of the future computational optimizations that come from the thousands of people employed to work on these mega-frameworks. As support for the less popular frameworks dies off (trust me, it eventually happens), you will sit comfortably, knowing that every one of your models run on well-supported, tensor-based inference frameworks.

After all, we're in the business of data science, not runtime optimization. It feels good to leverage the big guys to get the job done in that area, freeing us up to focus on our core competencies.

Conclusion🔗

As with a lot of other recent moves by Microsoft Research, I'm excited about Hummingbird. This is a great step towards consolidation in the rapidly diverging ML space, and some order from chaos is always a good thing. I'm sure the runtime hiccups of their CPU-based inferencing will be smoothed out over time while ensuring continued advantages on the GPU. As their updates get made, we're just a few clicks away from a GPU-enabled Deepnote notebook ready to test their claims!