Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Pandas version 2 #1306

Closed
siddharthab opened this issue Jan 19, 2024 · 10 comments
Closed

Support Pandas version 2 #1306

siddharthab opened this issue Jan 19, 2024 · 10 comments

Comments

@siddharthab
Copy link

Pandas 2.0.0 was released in April 2023. We should spend some effort to make this project compatible with the 2.y versions. Pandas 3 also has a dev release out, so maybe we can try for that as well now.

@pentschev
Copy link
Member

Technically speaking, Dask-CUDA has no compatibility issues with pandas 2, but for that to be useful you'll also need cuDF to support it and there's ongoing work for that, see rapidsai/cudf#13535, which #1213 is waiting for.

We are also happy to accept PRs in both Dask-CUDA and cuDF to expand support for libraries that our users need.

@siddharthab
Copy link
Author

Thank you. I suppose, for this repo, just removing the version constraint in pyproject.toml will help a lot. Currently, that constraint stops us from using Pandas 2 in our dask job at all, even if we don't use cuDF.

@pentschev
Copy link
Member

Thank you. I suppose, for this repo, just removing the version constraint in pyproject.toml will help a lot. Currently, that constraint stops us from using Pandas 2 in our dask job at all, even if we don't use cuDF.

I do not necessarily oppose but I do have mixed feelings about this. On the one end I understand your ask, but ultimately Dask-CUDA is primarily meant to be used with GPU libraries, which in this case in particular implies cuDF. Removing the pin would loosely communicate "we support pandas 2 already" which is not true because we can't test it yet.

@galipremsagar @shwina @rjzamora @quasiben do you have thoughts on this? Perhaps the current cuDF pin to pandas<2 would suffice and we could unblock users who are in the situation described above?

In any case, the most recent plan is to have pandas 2 support in 24.04, which is due early April.

@siddharthab
Copy link
Author

Thank you for your reply and thank you for considering the request.

which in this case in particular implies cuDF

I am not sure if that is the characterization everyone uses for Dask-CUDA currently, especially if you consider that cuDF is not even a listed dependency of Dask-CUDA. For example, we use Dask-CUDA for only LocalCUDACluster for our ML batch prediction workflows (we don't have cuDF installed in our environment), without using distributed data frames. Our workflows use pandas to do some lightweight preprocessing before distributing the workload, but the version limit in this repo limits the pandas version in our environment. I think any version limits in this repo should be about the usage of pandas in this repo.

@vyasr
Copy link
Contributor

vyasr commented Jan 19, 2024

I think it's fine to rely on cudf's upper bound for pandas. dask-cuda users who aren't using cudf should be free to use newer versions of pandas if it works for them.

@vyasr
Copy link
Contributor

vyasr commented Jan 26, 2024

And this is actually now blocking cudf's ability to test our pandas 2 support with dask, so I'm going to go ahead and open a PR to lift this constraint. Let's hope using the latest pandas doesn't break any of dask-cuda's own tests!

This was referenced Jan 26, 2024
@rjzamora
Copy link
Member

Let's hope using the latest pandas doesn't break any of dask-cuda's own tests!

No worries - I'll be happy to investigate anything that breaks :)

rapids-bot bot pushed a commit that referenced this issue Jan 26, 2024
dask-cuda uses pandas for some tests, but the main reason for the pinning is that it is inherited from RAPIDS libraries (mainly cudf) that do not yet support pandas 2.0 and are the primary use case for dask-cuda. However, there is no reason dask-cuda cannot be used in other contexts, so relaxing this constraint makes sense.

Resolves #1306

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Lawrence Mitchell (https://github.com/wence-)
  - Ray Douglass (https://github.com/raydouglass)

URL: #1308
@pentschev
Copy link
Member

Thanks @vyasr for taking care of this during my absence.

@pentschev
Copy link
Member

This was resolved by #1308 , closing.

@siddharthab
Copy link
Author

Thank you everyone for such a prompt resolution.

younseojava pushed a commit to ROCm/dask-cuda-rocm that referenced this issue Apr 16, 2024
dask-cuda uses pandas for some tests, but the main reason for the pinning is that it is inherited from RAPIDS libraries (mainly cudf) that do not yet support pandas 2.0 and are the primary use case for dask-cuda. However, there is no reason dask-cuda cannot be used in other contexts, so relaxing this constraint makes sense.

Resolves rapidsai#1306

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Lawrence Mitchell (https://github.com/wence-)
  - Ray Douglass (https://github.com/raydouglass)

URL: rapidsai#1308
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants