Experimental RAPIDS cuDF Backend for Velox #12412

bdice · 2025-02-20T21:32:21Z

Description

This issue tracks the development of a RAPIDS cuDF backend for Velox. The project will be implemented in a series of incremental pull requests described below. Our initial goal is to run the TPC-H query plans defined by Velox, thereby covering a broad range of basic data analytics operations end-to-end on GPU.

All of the work described below currently exists in a fork of this repository, which we hope to upstream piece by piece. I plan to update this issue frequently as we make progress on the initial stages of this work.

cuDF team: @karthikeyann @devavret @mhaseeb123 @GregoryKimball

cc: @pedroerp @oerling @Yuhta @assignUser

High level design

Build:
- A new Docker container with CUDA 12.8 is needed for developers (based on the existing Ubuntu 22.04 container). 🐳
- cuDF is fetched and built if the CMake option VELOX_ENABLE_CUDF is true. The dependency logic will go in CMake/resolve_dependency_modules/cudf.cmake. (cuDF only links to the CUDA runtime, not the CUDA driver library.)
- I plan to work with @assignUser on CI testing. We will probably want to run GPU builds on CPU nodes, and only run GPU CI tests if the cuDF code has changed.
Run:
- A Velox DriverAdapter is used to replace CPU operators with GPU operators that call cuDF's C++ code. This DriverAdapter can be registered at startup by the application using Velox to enable the cuDF backend.
  - Each operator will have a cuDF equivalent. For example, OrderBy will be replaced by CudfOrderBy.
- In between CPU operators and GPU operators, another conversion operator is inserted to handle CPU->GPU and GPU->CPU data movement. This allows cuDF operators to be used alongside existing Velox operators.
  - The conversion currently uses Arrow (Velox to Arrow, then Arrow to cuDF). A direct Velox-to-cuDF interop without Arrow may be built in the future for higher performance.
- Currently, no custom CUDA kernels are needed for this code. All functionality is implemented in pure C++ calling cuDF, which implements the CUDA kernels.
- There is a lot more to say about tuning for performance (GPU batch sizes, CUDA streams, number of Velox drivers, ...) but I'm leaving that out of this document for the moment.
Tests:
- We have some initial test coverage in place, based on existing Velox tests. We plan to expand the test coverage as we upstream this work.

Next steps

Currently, I expect that each of the lines below will be a separate PR.

Add CUDA Docker Images: build: Add Docker images for CUDA development #12413
Add cuDF to CMake build logic
Incremental feature development in velox/experimental/cudf/. The first few feature PRs will be:

cudfDriverAdapter for replacing operators
Data conversion (Velox to cuDF, cuDF to Velox -- both via Arrow intermediates) and CudfVector data structure for passing GPU data between operators
Operators, probably each in its own PR: HashJoin, OrderBy, aggregations, FilterProject, ...

I'm going to start opening PRs for the CI / build side soon. Ideas and feedback are welcome! 🚀

The text was updated successfully, but these errors were encountered:

bdice added the enhancement New feature or request label Feb 20, 2025

bdice mentioned this issue Feb 20, 2025

build: Add Docker images for CUDA development #12413

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental RAPIDS cuDF Backend for Velox #12412

Experimental RAPIDS cuDF Backend for Velox #12412

bdice commented Feb 20, 2025 •

edited

Loading

Experimental RAPIDS cuDF Backend for Velox #12412

Experimental RAPIDS cuDF Backend for Velox #12412

Comments

bdice commented Feb 20, 2025 • edited Loading

Description

High level design

Next steps

bdice commented Feb 20, 2025 •

edited

Loading