Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Fuzzy dedup params for long strings support. #77

Merged
merged 6 commits into from
Sep 6, 2024

Conversation

ayushdg
Copy link
Collaborator

@ayushdg ayushdg commented May 22, 2024

Fuzzy deduplication is currently accelerated via cuDF which until release 24.04 had a limit that a string column could not exceed int32 number of characters. Consequently some defaults and core logic in the deduplication pipeline aims to mitigate errors for cases where we may exceed this value.

Starting 24.06, cuDF has experimental support for longer strings (int64 number of chars), and this PR attempts to change defaults and simplify logic around handling long strings.

@ayushdg ayushdg requested a review from VibhuJawa May 23, 2024 00:01
@ayushdg ayushdg force-pushed the ayushdg/long-string-support branch from 3849f25 to 19777e7 Compare July 10, 2024 15:49
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me, just a minor nit around max_text_bytes_per_part

nemo_curator/modules/fuzzy_dedup.py Show resolved Hide resolved
@ayushdg ayushdg marked this pull request as ready for review September 6, 2024 00:36
@ayushdg ayushdg changed the title [Draft] Update Fuzzy dedup params for long strings support. Update Fuzzy dedup params for long strings support. Sep 6, 2024
@ayushdg ayushdg force-pushed the ayushdg/long-string-support branch from a892490 to deca160 Compare September 6, 2024 21:35
Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, added a general question.

Comment on lines +240 to +241
if input_meta is not None:
read_kwargs["prune_columns"] = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should move away from input_meta in favor of a keyword like dtype (like Pandas' and cuDF's read_json) and having the user configure prune_columns themselves?

I think this would align with #50 too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally in favor of overhauling the IO helpers in the current setup for something better. When we tackle #50. I'll share more thoughts there, but moving to encouraging users using the read_xyz api's is easier.
We can then have a common helper that based on the filetype directs to the relevant read_xyz api rather than the other way around where read_json goes to a common read method that handles different formats.

Regarding: prune_columns specifically: This change is important in newer versions of rapids because many public datasets like rpv1 do not have consistent metadata across all their files. If we do not prune columns to just ID & Text, cuDF will now fail with inconsistent metadata errors.

Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ayushdg ayushdg merged commit 762a670 into main Sep 6, 2024
3 checks passed
@ayushdg ayushdg deleted the ayushdg/long-string-support branch September 7, 2024 00:29
yyu22 pushed a commit to yyu22/NeMo-Curator that referenced this pull request Oct 9, 2024
* Expose more configurations to test long string support

Signed-off-by: Ayush Dattagupta <[email protected]>

* Export libcudf env for long string support

Signed-off-by: Ayush Dattagupta <[email protected]>

* Default to using larger batches

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove large strings env variable since it's enabled by default in 24.8 and above

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove debug print, filter nulls before bucketing

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove hardcoded id field

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Yang Yu <[email protected]>
yyu22 pushed a commit to yyu22/NeMo-Curator that referenced this pull request Oct 10, 2024
* Expose more configurations to test long string support

Signed-off-by: Ayush Dattagupta <[email protected]>

* Export libcudf env for long string support

Signed-off-by: Ayush Dattagupta <[email protected]>

* Default to using larger batches

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove large strings env variable since it's enabled by default in 24.8 and above

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove debug print, filter nulls before bucketing

Signed-off-by: Ayush Dattagupta <[email protected]>

* Remove hardcoded id field

Signed-off-by: Ayush Dattagupta <[email protected]>

---------

Signed-off-by: Ayush Dattagupta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants