Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retiring text_bytes_aware_shuffle to use shuffle directly #316

Merged

Conversation

praateekmahajan
Copy link
Collaborator

@praateekmahajan praateekmahajan commented Oct 21, 2024

Description

Now that cudf supports large strings, we can use shuffle directly.
Resolves #240

Here is the performance improvement when compared existing older implementation (which ultimately called rearrange_column_by_direct)
image

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
@praateekmahajan praateekmahajan added the gpuci Run GPU CI/CD on PR label Oct 22, 2024
@sarahyurick
Copy link
Collaborator

Thanks! Was wondering if you have any small-scale examples that could be added to test_fuzzy_dedup.py?

Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me on the assumption we have verified output_shuffled_docs_path looked the same for both main and the PR .

@praateekmahajan praateekmahajan marked this pull request as draft October 23, 2024 20:07
@praateekmahajan praateekmahajan linked an issue Oct 28, 2024 that may be closed by this pull request
@praateekmahajan praateekmahajan changed the title Retiring text_bytes_aware_shuffle to use shuffle directly [WIP] Retiring text_bytes_aware_shuffle to use shuffle directly Oct 31, 2024
Comment on lines 1158 to 1160
elif os.environ["SHUFFLE_APPROACH"] == "dask_vanilla":
self._logger.info("Using dask's vanilla shuffle")
output_df = subset_merged_df.shuffle(on=partition_on)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add shuffle_approach to FuzzyDuplicatesConfig if there's more than one of these that you think we should keep in the codebase. Up to you.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarahyurick we won't be keeping all approaches, I just had them for my own benchmarking efforts. I'll update the PR with the results, and based on those we'll only be keeping the approach of dask_vanilla (once @ayushdg is able to confirm at scale too)

@praateekmahajan praateekmahajan changed the title [WIP] Retiring text_bytes_aware_shuffle to use shuffle directly Retiring text_bytes_aware_shuffle to use shuffle directly Nov 19, 2024
@praateekmahajan praateekmahajan marked this pull request as ready for review November 19, 2024 17:07
@praateekmahajan praateekmahajan added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Nov 19, 2024
Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@praateekmahajan praateekmahajan merged commit 914e61b into NVIDIA:main Nov 20, 2024
5 checks passed
vinay-raman pushed a commit to vinay-raman/NeMo-Curator that referenced this pull request Nov 26, 2024
ruchaa-apte pushed a commit to ruchaa-apte/NeMo-Curator that referenced this pull request Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpuci Run GPU CI/CD on PR
Projects
None yet
3 participants