10 Minutes to cuDF and Dask cuDF
Centered around Apache Arrow DataFrames on the GPU, RAPIDS is designed to enable end-to-end data science and analytics on GPUs. Together, open source libraries like RAPIDS cuDF and Dask let users process tabular data on GPUs at scale with a familiar, pandas-like API.
With Dask, anything you can do on a single GPU with cuDF you can scale out and in parallel across multiple GPUs. Fundamentally, an instance of a cudf.DataFrame object is a single partition of a distributed GPU DataFrame, managed by Dask. Distributing DataFrames and computation with Dask lets you analyze datasets far larger than a single GPU’s memory without running into out of memory errors. The RAPIDS team is working with Dask maintainers and developers to fully support the GPU DataFrame in Dask.
Today, we’re excited to share “10 Minutes to cuDF and Dask-cuDF”, an update to our original introduction to GPU DataFrame analytics, “10 Minutes to cuDF”. In this new documentation, we start with single GPU cuDF examples and then show how to do the same operations on multiple GPUs with minimal code changes via Dask-cuDF. Eventually, you’ll be able to use the same standard Dask library you know and love with cuDF without importing dask_cudf.
With cuDF and Dask, whether you’re using a single NVIDIA T4 GPU or using all eight NVIDIA V100 GPUs in a DGX-1, your RAPIDS workflow will run smoothly — intelligently distributing the workload across the resources available. Interested in giving it a try? You can quickly get started with RAPIDS on Google Colab or try our RAPIDS early access Dask implementation on Google Cloud Dataproc.
Check out cuDF on Github and let us know what you think! You can download pre-built Docker containers for our RAPIDS 0.7 release from NGC or Dockerhub to get started, or install it yourself via Conda. Don’t want to wait for the next release to use upcoming features? You can download our nightly containers here or install via Conda to stay at the tip of our development branch.