Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars GPU engine slower than CPU when joining #20696

Closed
2 tasks done
mliu-aqtc opened this issue Jan 13, 2025 · 6 comments
Closed
2 tasks done

Polars GPU engine slower than CPU when joining #20696

mliu-aqtc opened this issue Jan 13, 2025 · 6 comments
Labels
A-gpu Area: gpu engine needs triage Awaiting prioritization by a maintainer performance Performance issues or improvements python Related to Python Polars

Comments

@mliu-aqtc
Copy link

mliu-aqtc commented Jan 13, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import numpy as np
import time

def make_df():
    num_rows = 10000
    num_columns = 5000
    index_col = np.arange(num_rows)
    return pl.DataFrame(
        {
            "index": index_col,
            **{
                f"float_col_{i}": np.random.random(size=num_rows)
                for i in range(num_columns)
            },
        }
    )


df1 = make_df()
df2 = make_df()

engine = pl.GPUEngine(raise_on_fail=True)
# engine = "cpu"

start = time.monotonic()
result = df1.lazy().join(
    df2.lazy(),
    on="index",
    how="inner"
).collect(engine=engine)

print(time.monotonic() - start)
print(result.shape)

gpu:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:4A:00.0 Off |                    0 |
| N/A   26C    P8             32W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L40S                    Off |   00000000:61:00.0 Off |                    0 |
| N/A   26C    P8             32W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L40S                    Off |   00000000:CA:00.0 Off |                    0 |
| N/A   26C    P8             31W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L40S                    Off |   00000000:E1:00.0 Off |                    0 |
| N/A   27C    P8             31W /  350W |       1MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

Log output

No response

Issue description

Running the above join code using cpu takes around 0.1s, whereas using the gpu engine takes around 7s.

Expected behavior

I would expect gpu to not be significantly slower than cpu for a join operation.

Installed versions

--------Version info---------
Polars:              1.12.0
Index type:          UInt32
Platform:            Linux-5.4.0-200-generic-x86_64-with-glibc2.31
Python:              3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
LTS CPU:             False
@mliu-aqtc mliu-aqtc added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 13, 2025
@ritchie46 ritchie46 added A-gpu Area: gpu engine performance Performance issues or improvements and removed bug Something isn't working labels Jan 14, 2025
@vyasr
Copy link

vyasr commented Jan 16, 2025

Thanks for the report @mliu-aqtc! I suspect that what you're observing here is that the actual compute in this case is very small for a GPU (a single join column with only 10000 rows) and you're getting hit with a lot of overhead from 1) the host->device->host data transfers required by the GPU engine and 2) the large number of independent allocations being done by the GPU engine for every columns (the Arrow data format used by Polars is columnar, and per-column allocation is far more expensive on the GPU than on the CPU). That being said, the gap is pretty wide. I or someone on my team will take a look at a profile as soon as we can and see what we find.

@vyasr
Copy link

vyasr commented Jan 17, 2025

I took a quick look here, and indeed it does seem like the current bottleneck is just copying. Here's a profile:

Image

Out of a total of 12.639 seconds running this script, the important subsection is the part under _execute_from_rust, which took 11.048 seconds. Out of that we see:

  • 9.087 Join.evaluate
  • 1.096 DataFrame.to_polars
  • 0.863 _GeneratorContextManager.enter

The context manager shows up because of calls that are needed to initialize the CUDA context and to set up a pool of GPU memory. Those are tasks that will always be necessary when using the GPU engine, but they are one-time costs occurring the first time that a polars GPU query runs, so in a more complex script they wouldn't be nearly as important. The to_polars call indicates that we're spending about a second converting the GPU result back to host data for Polars.

If we zoom in on the Join.evaluate, we actually see this:

  • 8.037 DataFrame.from_polars
  • 1.040 Join.do_evaluate

which in fact tells us that we're spending 8 seconds just converting our data into GPU data. It's not surprising that this takes significantly longer than converting back because in your script you have a wide table (lots of columns), and the Arrow data format stores each column as a separate buffer, so there is a lot more routine overhead in the conversion in that direction. That leaves 1 second in the actual join. We can't see a further breakdown with a simple Python profile here since all the work is happening in C++/CUDA, but if I use a CUDA profiler I see that the join is actually just spending the vast major of the time creating the buffers needed to hold all the output data; the actual join computation is <1% of the total time.

The upshot here is that I think that this script is a particularly bad case for the GPU for a number of reasons. Some of these should improve over time as the GPU plugin improves, others are idiosyncratic to this kind of benchmark and will not persist in more realistic workloads, and some are unfortunately intrinsic limitations of this kind of mixed CPU-GPU execution model:

  1. Your query is essentially trivial in that it only performs a single operation (the join) as part of the collect. Therefore the overhead of copying is very large relative to the amount of work being done. You'll get much better mileage out of the GPU plugin if you have a complex sequence of operations in there that benefit from significant GPU acceleration (multiple joins, more complex joins, aggregations, etc).
  2. You only execute a single collect, so the amount of time spent doing start-up work (the context manager) shows up disproportionately large relative to how it would look in a normal script or in e.g. interactive Jupyter usage. It's also worth noting that at the moment each collect call triggers a host->device data transfer so that the GPU can compute, followed by a transfer back to host. We could substantially improve this situation if the GPU engine had a way to persist the data on the GPU between collect calls and only copy it back when an operation cannot be performed. That isn't currently possible, but is a potential future optimization.
  3. There is probably nonzero overhead in the polars->cudf->polars conversion layer that doesn't show up super cleanly in my above benchmark. That should reduce as we improve that intermediate layer over time.
  4. You have a short (few rows) and wide (many columns) table. That is unfortunately probably always going to be the worst case scenario for the GPU engine due to 1) the Arrow layout leading to many copies for wide tables as mentioned above and 2) the fact that GPU compute is most effective when given large amounts of compute and since most compute is done on a per-column basis the small number of rows is not going to feed the GPU very effectively. You will in general see much more benefit from a GPU when you are looking at tables with many rows and fewer columns.

To illustrate 4 above, if I run your script with num_rows=1_000_000 and num_columns=5, I see that CPU Polars takes 0.05 s while the Join.evaluate node in GPU Polars takes 0.025, or about half. Of course, the overall GPU Polars runtime is still significantly higher (1.12 s) due to the fixed costs mentioned above, but again we'd expect those to matter less for more complex queries and longer scripts.

I should also mention that host-device data transfers should become increasingly cheaper with newer NVIDIA hardware, especially with integrated CPU-GPU SoC like Grace-Hopper where the interconnects are huge, so some of the performance bottlenecks specific to the GPU engine will be differentially ameliorated relatively to the CPU engine by running on these chips.

@mliu-aqtc
Copy link
Author

Thanks @vyasr for taking a look!

There is probably nonzero overhead in the polars->cudf->polars conversion layer

Yeah it looks like the transfer to device is the significant bottleneck here, specifically the cudf integration. We were seeing something similar in our latest profiling:

  %Own   %Total  OwnTime  TotalTime  Function (filename)                                                                          
  0.00%   0.00%   11.19s    12.54s   from_polars (cudf_polars/containers/dataframe.py)

You have a short (few rows) and wide (many columns) table. That is unfortunately probably always going to be the worst case scenario for the GPU engine

Gotcha, we do have a fairly wide df due to the number of features we are working with. If this is a blocker, perhaps we need to consider some other format for our data. Altho perhaps if the cudf integration is improved this metric will be better at least?

Your query is essentially trivial in that it only performs a single operation (the join) as part of the collect

Makes sense, I was trying to provide as small as possible of a repro example, but I can see how the transfer cost will be amortized if the operations are larger. Overall tho, several seconds feels like much longer than expected for device transfer, even for large, wide dataframes. But as mentioned, there could be some optimization to do in the cudf conversion.

@vyasr
Copy link

vyasr commented Jan 21, 2025

It's likely that performance will improve over time, yes. I am also suprised by just how large this number is. However, it is very unlikely that the GPU engine will ever be performant relative to the CPU for short and wide tables. Any Arrow-based engine will show better perf with tall tables, but the GPU engine gets a double-whammy here due to the additional requirement of many independent memory allocations (in general GPUs provide the biggest advantage over CPUs when your performance is gated on compute, not memory). In general if you are looking to get the best performance out of any Arrow-based engine (including Polars) I would recommend considering restructuring your data to use tall tables for the best performance anyway. Improved GPU utilization will naturally follow.

@mliu-aqtc
Copy link
Author

Great point. Yes, this is of course a small repro example and realistic workflows will have much taller data (tho the width may not go down). For now we will stick to CPU. Looking forward to improved performance in the future!

@vyasr
Copy link

vyasr commented Jan 21, 2025

np. If you can, please close this issue for now (I don't have the permissions). I can post updates in the future if we start seeing notable benchmark changes 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-gpu Area: gpu engine needs triage Awaiting prioritization by a maintainer performance Performance issues or improvements python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants