Get a 300x speed up of your Machine Learning pipeline with RapidsAI

Have you ever run Nearest Neighbors on the GPU card! No? Why are you still not reading then?

https://unsplash.com/photos/9HI8UJMSdZA

When using simple Machine Learning algorithms, like Nearest Neighbours, on huge datasets, it often becomes pain to find good model hyperparameters or even to build a strong crossvalidation framework, because it takes model ages to finish training even if using simple train test split! One way to overcome this would be to use some way to distribute over CPUs using Dask or PySpark. But today I want to show you another way out — fitting the model using your GPU power. Previously there was no good way of doing this with models from Sklearn library, but now you can fit a vast majority of sklearn models like that, using Rapids AI cuML library! In this article I want to show you a quick comparisson of Rapids cuML vs Sklearn on Neareset Neighbors algorithm. If you are interested in the installation process it is described pretty well on the Rapids github page. Let’s get started!

The Code

I will use Kaggle platform, which provides 30 hours per week of free K-80 GPU usage (TPUs are also available).

Let’s, as a first step, import Nearest Neighbors algorithm from both libraries. The API calls look identical in both libraries.

Now, I will create a dummy dataset with sklearn.datasets.make_blobs method. I will be creating 5 datasets of different sizes: from small to big ones. On each of the datasets I will fit the Nearest Neighbors model with 10 neighbors and then calculate the nearest neighbors for each point in the test set. Let’s fit Rapids GPU Nearest Neighbors first.

As you see, it performed really fast. Now, it is time to fit Sklearn.

Wow! This was really long. As you see on the small datasets sklearn outperforms rapids, but on the biggest dataset we got a 300x speed up.

Let’s also build a small comparison graph using Plotly (I will be using log transformation, to make results better visible)

Wow, I sometimes still don’t believe in such huge speed up, the cuML library is definetly worth of trying! It would be also interesting to compare Rapids with something like Dask, but this is an idea for another post.

Thanks for your read!