Update Fuzzy dedup params for long strings support. #77

ayushdg · 2024-05-22T20:09:42Z

Fuzzy deduplication is currently accelerated via cuDF which until release 24.04 had a limit that a string column could not exceed int32 number of characters. Consequently some defaults and core logic in the deduplication pipeline aims to mitigate errors for cases where we may exceed this value.

Starting 24.06, cuDF has experimental support for longer strings (int64 number of chars), and this PR attempts to change defaults and simplify logic around handling long strings.

VibhuJawa

Mostly looks good to me, just a minor nit around max_text_bytes_per_part

nemo_curator/modules/fuzzy_dedup.py

nemo_curator/utils/fuzzy_dedup_utils/shuffle_utils.py

Signed-off-by: Ayush Dattagupta <[email protected]>

….8 and above Signed-off-by: Ayush Dattagupta <[email protected]>

Signed-off-by: Ayush Dattagupta <[email protected]>

sarahyurick

LGTM, added a general question.

sarahyurick · 2024-09-06T23:01:04Z

nemo_curator/utils/distributed_utils.py

+            if input_meta is not None:
+                read_kwargs["prune_columns"] = True


Do you think we should move away from input_meta in favor of a keyword like dtype (like Pandas' and cuDF's read_json) and having the user configure prune_columns themselves?

I think this would align with #50 too.

I'm generally in favor of overhauling the IO helpers in the current setup for something better. When we tackle #50. I'll share more thoughts there, but moving to encouraging users using the read_xyz api's is easier.
We can then have a common helper that based on the filetype directs to the relevant read_xyz api rather than the other way around where read_json goes to a common read method that handles different formats.

Regarding: prune_columns specifically: This change is important in newer versions of rapids because many public datasets like rpv1 do not have consistent metadata across all their files. If we do not prune columns to just ID & Text, cuDF will now fail with inconsistent metadata errors.

ryantwolf

LGTM

VibhuJawa

LGTM

* Expose more configurations to test long string support Signed-off-by: Ayush Dattagupta <[email protected]> * Export libcudf env for long string support Signed-off-by: Ayush Dattagupta <[email protected]> * Default to using larger batches Signed-off-by: Ayush Dattagupta <[email protected]> * Remove large strings env variable since it's enabled by default in 24.8 and above Signed-off-by: Ayush Dattagupta <[email protected]> * Remove debug print, filter nulls before bucketing Signed-off-by: Ayush Dattagupta <[email protected]> * Remove hardcoded id field Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Yang Yu <[email protected]>

* Expose more configurations to test long string support Signed-off-by: Ayush Dattagupta <[email protected]> * Export libcudf env for long string support Signed-off-by: Ayush Dattagupta <[email protected]> * Default to using larger batches Signed-off-by: Ayush Dattagupta <[email protected]> * Remove large strings env variable since it's enabled by default in 24.8 and above Signed-off-by: Ayush Dattagupta <[email protected]> * Remove debug print, filter nulls before bucketing Signed-off-by: Ayush Dattagupta <[email protected]> * Remove hardcoded id field Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]>

ayushdg requested a review from VibhuJawa May 23, 2024 00:01

ayushdg force-pushed the ayushdg/long-string-support branch from 3849f25 to 19777e7 Compare July 10, 2024 15:49

VibhuJawa requested changes Sep 5, 2024

View reviewed changes

nemo_curator/modules/fuzzy_dedup.py Show resolved Hide resolved

nemo_curator/utils/fuzzy_dedup_utils/shuffle_utils.py Show resolved Hide resolved

ayushdg marked this pull request as ready for review September 6, 2024 00:36

ayushdg changed the title ~~[Draft] Update Fuzzy dedup params for long strings support.~~ Update Fuzzy dedup params for long strings support. Sep 6, 2024

ayushdg added 3 commits September 6, 2024 14:33

Expose more configurations to test long string support

53a86ca

Signed-off-by: Ayush Dattagupta <[email protected]>

Export libcudf env for long string support

d29ce8f

Signed-off-by: Ayush Dattagupta <[email protected]>

Default to using larger batches

deca160

Signed-off-by: Ayush Dattagupta <[email protected]>

ayushdg force-pushed the ayushdg/long-string-support branch from a892490 to deca160 Compare September 6, 2024 21:35

ayushdg added 3 commits September 6, 2024 14:37

Remove large strings env variable since it's enabled by default in 24…

5559ce1

….8 and above Signed-off-by: Ayush Dattagupta <[email protected]>

Remove debug print, filter nulls before bucketing

73e12b6

Signed-off-by: Ayush Dattagupta <[email protected]>

Remove hardcoded id field

cbf1d66

Signed-off-by: Ayush Dattagupta <[email protected]>

ayushdg mentioned this pull request Sep 6, 2024

Make max_text_bytes_per_part configurable #233

Closed

ayushdg requested review from VibhuJawa, sarahyurick and ryantwolf September 6, 2024 21:51

sarahyurick approved these changes Sep 6, 2024

View reviewed changes

ryantwolf approved these changes Sep 6, 2024

View reviewed changes

VibhuJawa approved these changes Sep 6, 2024

View reviewed changes

ayushdg merged commit 762a670 into main Sep 6, 2024
3 checks passed

ayushdg deleted the ayushdg/long-string-support branch September 7, 2024 00:29

sarahyurick mentioned this pull request Sep 9, 2024

Better mimic DocumentDataset's read_* functions to Dask's read_* functions #50

Open

VibhuJawa mentioned this pull request Sep 10, 2024

Retire text_bytes_aware_shuffle and directly use shuffle #240

Closed

sarahyurick mentioned this pull request Oct 28, 2024

Deprecate max_text_bytes_per_part #331

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Fuzzy dedup params for long strings support. #77

Update Fuzzy dedup params for long strings support. #77

ayushdg commented May 22, 2024 •

edited

Loading

VibhuJawa left a comment

sarahyurick left a comment

sarahyurick Sep 6, 2024

ayushdg Sep 6, 2024

ryantwolf left a comment

VibhuJawa left a comment

		if input_meta is not None:
		read_kwargs["prune_columns"] = True

Update Fuzzy dedup params for long strings support. #77

Update Fuzzy dedup params for long strings support. #77

Conversation

ayushdg commented May 22, 2024 • edited Loading

VibhuJawa left a comment

Choose a reason for hiding this comment

sarahyurick left a comment

Choose a reason for hiding this comment

sarahyurick Sep 6, 2024

Choose a reason for hiding this comment

ayushdg Sep 6, 2024

Choose a reason for hiding this comment

ryantwolf left a comment

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

ayushdg commented May 22, 2024 •

edited

Loading