-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polars GPU engine slower than CPU when joining #20696
Comments
Thanks for the report @mliu-aqtc! I suspect that what you're observing here is that the actual compute in this case is very small for a GPU (a single join column with only 10000 rows) and you're getting hit with a lot of overhead from 1) the host->device->host data transfers required by the GPU engine and 2) the large number of independent allocations being done by the GPU engine for every columns (the Arrow data format used by Polars is columnar, and per-column allocation is far more expensive on the GPU than on the CPU). That being said, the gap is pretty wide. I or someone on my team will take a look at a profile as soon as we can and see what we find. |
I took a quick look here, and indeed it does seem like the current bottleneck is just copying. Here's a profile: Out of a total of 12.639 seconds running this script, the important subsection is the part under
The context manager shows up because of calls that are needed to initialize the CUDA context and to set up a pool of GPU memory. Those are tasks that will always be necessary when using the GPU engine, but they are one-time costs occurring the first time that a polars GPU query runs, so in a more complex script they wouldn't be nearly as important. The to_polars call indicates that we're spending about a second converting the GPU result back to host data for Polars. If we zoom in on the Join.evaluate, we actually see this:
which in fact tells us that we're spending 8 seconds just converting our data into GPU data. It's not surprising that this takes significantly longer than converting back because in your script you have a wide table (lots of columns), and the Arrow data format stores each column as a separate buffer, so there is a lot more routine overhead in the conversion in that direction. That leaves 1 second in the actual join. We can't see a further breakdown with a simple Python profile here since all the work is happening in C++/CUDA, but if I use a CUDA profiler I see that the join is actually just spending the vast major of the time creating the buffers needed to hold all the output data; the actual join computation is <1% of the total time. The upshot here is that I think that this script is a particularly bad case for the GPU for a number of reasons. Some of these should improve over time as the GPU plugin improves, others are idiosyncratic to this kind of benchmark and will not persist in more realistic workloads, and some are unfortunately intrinsic limitations of this kind of mixed CPU-GPU execution model:
To illustrate 4 above, if I run your script with I should also mention that host-device data transfers should become increasingly cheaper with newer NVIDIA hardware, especially with integrated CPU-GPU SoC like Grace-Hopper where the interconnects are huge, so some of the performance bottlenecks specific to the GPU engine will be differentially ameliorated relatively to the CPU engine by running on these chips. |
Thanks @vyasr for taking a look!
Yeah it looks like the transfer to device is the significant bottleneck here, specifically the cudf integration. We were seeing something similar in our latest profiling:
Gotcha, we do have a fairly wide df due to the number of features we are working with. If this is a blocker, perhaps we need to consider some other format for our data. Altho perhaps if the cudf integration is improved this metric will be better at least?
Makes sense, I was trying to provide as small as possible of a repro example, but I can see how the transfer cost will be amortized if the operations are larger. Overall tho, several seconds feels like much longer than expected for device transfer, even for large, wide dataframes. But as mentioned, there could be some optimization to do in the cudf conversion. |
It's likely that performance will improve over time, yes. I am also suprised by just how large this number is. However, it is very unlikely that the GPU engine will ever be performant relative to the CPU for short and wide tables. Any Arrow-based engine will show better perf with tall tables, but the GPU engine gets a double-whammy here due to the additional requirement of many independent memory allocations (in general GPUs provide the biggest advantage over CPUs when your performance is gated on compute, not memory). In general if you are looking to get the best performance out of any Arrow-based engine (including Polars) I would recommend considering restructuring your data to use tall tables for the best performance anyway. Improved GPU utilization will naturally follow. |
Great point. Yes, this is of course a small repro example and realistic workflows will have much taller data (tho the width may not go down). For now we will stick to CPU. Looking forward to improved performance in the future! |
np. If you can, please close this issue for now (I don't have the permissions). I can post updates in the future if we start seeing notable benchmark changes 🙂 |
Checks
Reproducible example
gpu:
Log output
No response
Issue description
Running the above join code using cpu takes around 0.1s, whereas using the gpu engine takes around 7s.
Expected behavior
I would expect gpu to not be significantly slower than cpu for a join operation.
Installed versions
The text was updated successfully, but these errors were encountered: