Removal logic for Exact / Fuzzy Dedup #499

praateekmahajan · 2025-01-28T20:43:26Z

Description

Removal Description

We implement ~left-anti join using a broadcast merge. This allows us to scale even when right is greater than memory per node.
We observe that the performance varies as num partitions (or partition sizes vary).
- The fastest being where right is exactly one partition, in which case we broadcast all of right to all the workers, and perform a merge (resulting in least transfer).
One downside of this approach is when right is bigger than system memory per node, then the merge operation will have to spill, and that won't be possible, hence resulting in a stuck state where dask pauses workers once it reaches 80% capacity in which case user will have to implement a different merge logic.

Class Description

Implements an abstract class called Deduplicator (name suggestions welcome)
1. Lives in nemo_curator._deduplicator.py
7. Implements identify / remove / call
8. call now accepts a boolean called perform_removal (alternative API's also accepted)
Fuzzy / Exact Duplicates extend this class and the __call__ is now renamed as identify since the __call__ is implemented in base class

Usage

dataset = DocumentDataset.read_parquet(...)
exact_dedup = ExactDeduplicator(..)
duplicates = exact_dedup(dataset) # or exact_dedup.identify(exact_dedup)

exact_dedup.remove(dataset)

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Praateek <[email protected]>

ryantwolf

Couple of comments about the organization. Can you make sure you update the API docs and deduplication sections in the user guide to mention all this?

One downside of this approach is when right is bigger than system memory per node, then the merge operation will have to spill, and that won't be possible, hence resulting in a stuck state where dask pauses workers once it reaches 80% capacity in which case user will have to implement a different merge logic.

Make sure to mention this in the user guide and describe roughly when "right is bigger than system memory per node" in layman's terms. Is this something we have/will encounter? Wondering if globally deduping 240 TB would trigger this.

ryantwolf · 2025-01-30T16:21:59Z

nemo_curator/_deduplicator.py

Can this be moved under nemo_curator/modules/deduplicator.py? I think the function is something we want users to be able to access.

ryantwolf · 2025-01-30T16:27:11Z

nemo_curator/_deduplicator.py

+    return removed_result
+
+
+class Deduplicator(ABC):


Not sure how I feel about the abstraction. I have been wanting something like this, but I worry this is not as generalizable as I'd want it to be. For example, can semantic dedupe use this? I don't believe it can since the duplicates aren't all grouped like this. Imo, if the deduplication abstraction doesn't work for all our deduplication methods I don't want to have it so we don't confuse our users. We can always refactor out the logic into a base class if we find a general solution.

Maybe call it

Suggested change

class Deduplicator(ABC):

class DuplicateRemover:

or DuplicatesRemover instead?

Possible example usage:

remover = DuplicatesRemover(...) exact_dupes = ExactDuplicates(...).identify_duplicates(...) deduped_data = remover.remove_duplicates(exact_dupes) fuzzy_dupes = FuzzyDuplicates(...).identify_duplicates(...) deduped_data = remover.remove_duplicates(fuzzy_dupes) # Could it be possible to call both simultaneously? # deduped_data = remover.remove_duplicates(exact_dupes, fuzzy_dupes) # deduped_data = remover.remove_duplicates([exact_dupes, fuzzy_dupes]) # or similar...

?

ryantwolf · 2025-01-30T16:28:52Z

nemo_curator/_deduplicator.py

+        Should implement the logic for identifying duplicates in the dataset."""
+        raise NotImplementedError
+
+    def remove(


I think this function could just be a helper function that's defined here that exact and fuzzy dedup import and use instead of the ABC.

sarahyurick · 2025-01-30T18:59:06Z

nemo_curator/_deduplicator.py

+    return removed_result
+
+
+class Deduplicator(ABC):


Maybe call it

Suggested change

class Deduplicator(ABC):

class DuplicateRemover:

or DuplicatesRemover instead?

sarahyurick · 2025-01-30T18:59:38Z

nemo_curator/modules/exact_dedup.py

 from nemo_curator.datasets import DocumentDataset
 from nemo_curator.log import create_logger
 from nemo_curator.utils.distributed_utils import performance_report_if_with_ts_suffix
 from nemo_curator.utils.gpu_utils import is_cudf_type


-class ExactDuplicates:
+class ExactDuplicates(Deduplicator):


Suggested change

class ExactDuplicates(Deduplicator):

class ExactDuplicates:

?

sarahyurick · 2025-01-30T18:59:53Z

nemo_curator/modules/fuzzy_dedup/fuzzyduplicates.py

@@ -35,7 +34,7 @@
 from nemo_curator.utils.distributed_utils import performance_report_if_with_ts_suffix


-class FuzzyDuplicates:
+class FuzzyDuplicates(Deduplicator):


Suggested change

class FuzzyDuplicates(Deduplicator):

class FuzzyDuplicates:

?

sarahyurick · 2025-01-30T19:05:23Z

nemo_curator/_deduplicator.py

+    return removed_result
+
+
+class Deduplicator(ABC):


Possible example usage:

remover = DuplicatesRemover(...) exact_dupes = ExactDuplicates(...).identify_duplicates(...) deduped_data = remover.remove_duplicates(exact_dupes) fuzzy_dupes = FuzzyDuplicates(...).identify_duplicates(...) deduped_data = remover.remove_duplicates(fuzzy_dupes) # Could it be possible to call both simultaneously? # deduped_data = remover.remove_duplicates(exact_dupes, fuzzy_dupes) # deduped_data = remover.remove_duplicates([exact_dupes, fuzzy_dupes]) # or similar...

?

praateekmahajan · 2025-01-31T23:07:38Z

@sarahyurick / @ryantwolf given both of you had thoughts about it not being an abstract class, I took your suggestions and started this #509
@ryantwolf I'll add the docs once we are clear on what the API is, so that I don't have to change them repeatedly.

Regarding your question on if we've run into a case where it's greater than system (host) memory yes.
In one of our experiments, a 64TB dataset produced ~300gb of duplicates, and our host memory was 250gb per worker, which didn't work out for us, as CUDF during merge was spilling to host memory, and running OOM.
In that case, we'd likely recommend the user, to implement something custom where they do it at a partition level (where each partition typically is a year of CC or something similar)

It's still an open question on why dask / cudf needed to have the whole ~300gb in memory, but for now at that scale this won't work. It works wherever the duplicates size << host memory which can vary for different systems.

fc

89b9005

Signed-off-by: Praateek <[email protected]>

praateekmahajan added the gpuci Run GPU CI/CD on PR label Jan 28, 2025

This was referenced Jan 28, 2025

[WIP] Efficient Removal Duplicate Code #472

Closed

[WIP] Efficient Exact Duplicate Removal Code #404

Closed

praateekmahajan added 2 commits January 29, 2025 17:04

add shuffle/ tests

37f6bee

Signed-off-by: Praateek <[email protected]>

more test for class

69c8955

Signed-off-by: Praateek <[email protected]>

praateekmahajan changed the title ~~[WIP] Removal logic for Exact / Fuzzy Dedup~~ Removal logic for Exact / Fuzzy Dedup Jan 30, 2025

praateekmahajan requested review from sarahyurick, ayushdg and VibhuJawa January 30, 2025 01:56

pre-commit

de25476

Signed-off-by: Praateek <[email protected]>

praateekmahajan requested a review from ryantwolf January 30, 2025 02:07

praateekmahajan added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Jan 30, 2025

ryantwolf reviewed Jan 30, 2025

View reviewed changes

sarahyurick reviewed Jan 30, 2025

View reviewed changes

praateekmahajan marked this pull request as draft January 31, 2025 23:01

sarahyurick closed this Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removal logic for Exact / Fuzzy Dedup #499

Removal logic for Exact / Fuzzy Dedup #499

praateekmahajan commented Jan 28, 2025 •

edited

Loading

ryantwolf left a comment

ryantwolf Jan 30, 2025

ryantwolf Jan 30, 2025

sarahyurick Jan 30, 2025

sarahyurick Jan 30, 2025

ryantwolf Jan 30, 2025

sarahyurick Jan 30, 2025

sarahyurick Jan 30, 2025

sarahyurick Jan 30, 2025

sarahyurick Jan 30, 2025

praateekmahajan commented Jan 31, 2025

Removal logic for Exact / Fuzzy Dedup #499

Removal logic for Exact / Fuzzy Dedup #499

Conversation

praateekmahajan commented Jan 28, 2025 • edited Loading

Description

Removal Description

Class Description

Usage

Checklist

ryantwolf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

praateekmahajan commented Jan 31, 2025

praateekmahajan commented Jan 28, 2025 •

edited

Loading