Removal logic for fuzzy / exact (no class abstraction) #509

praateekmahajan · 2025-01-31T22:57:59Z

Description

See #529 for much more detailed analysis on the decision and followups

Removal Description

We implement ~left-anti join using a broadcast merge. This allows us to scale even when right is greater than device memory per node.
In our exact and fuzzy deduplication, we perform a shuffle at write time as the last step in identification. When reading for removal, we read FPP=1 so that we are reading a group at a time and not incurring "another" shuffle. By "another" shuffle I mean, that in a broadcast join, right is also reshuffled based on hash(id_col) so forcing a shuffle again will result in a double shuffle at read time.
We observe that the performance varies as num partitions (or partition sizes vary).
- The fastest being where right is exactly one partition, in which case we broadcast all of right to all the workers, and perform a merge (resulting in least transfer).
One downside of this approach is when right is bigger than system memory per node, then the merge operation will have to spill, and that won't be possible, hence resulting in a stuck state where dask pauses workers once it reaches 80% capacity in which case user will have to implement a different merge logic.

Class Description

Adds a remove method to Fuzzy/ExactDeduplicator that calls nemo_curator.modules.removal.remove_duplicates
Moves logic from call to identify_duplicates.
Adds another paratmer to class construction for perform_removal which by default is False, to retain old behavior. But when True call removes the duplicates as well.

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Praateek <[email protected]>

ryantwolf

Looks good overall, one final change though can you modify nemo_curator/modules/__init__.py to include nemo_curator.modules.removal.remove_duplicates in the __all__? Model it after blend_datasets if you need a reference.

sarahyurick · 2025-02-03T17:39:49Z

nemo_curator/modules/exact_dedup.py

@@ -135,7 +136,7 @@ def hash_documents(
            # TODO: Generalize ty using self.hash_method
            return df.apply(lambda x: md5(x.encode()).hexdigest())

-    def __call__(self, dataset: DocumentDataset) -> Union[DocumentDataset, str]:
+    def identify(self, dataset: DocumentDataset) -> DocumentDataset:


Suggested change

def identify(self, dataset: DocumentDataset) -> DocumentDataset:

def _identify(self, dataset: DocumentDataset) -> DocumentDataset:

Nit, but maybe call them _identify and _remove if they are not intended to be accessed by the user directly.

Let's keep them exposed especially since remove won't work at scales where size of duplicate >> host memory, in which case the user will need to break down identify and remove

Yes, that makes sense to me. What about calling it identify_duplicates?

Sounds good, initially I thought it's slightly verbose, but another argument in favor of identify_duplicates would be that in future we might want to expose identify_documents_to_keep in which the distinction might be necessary

cc @ayushdg / @ryantwolf / @VibhuJawa

nemo_curator/modules/exact_dedup.py

nemo_curator/modules/removal.py

Signed-off-by: Praateek <[email protected]>

ryantwolf

Looks good overall, just a few comments.

nemo_curator/modules/exact_dedup.py

nemo_curator/modules/fuzzy_dedup/fuzzyduplicates.py

nemo_curator/modules/semantic_dedup/clusteringmodel.py

sarahyurick · 2025-02-04T19:41:09Z

docs/user-guide/gpudeduplication.rst

@@ -82,9 +82,11 @@ After ensuring your dataset has a unique ID field (or creating one with the code
    107  doc_prefix-52271  0f763a2937d57b9d96bf9f220e55f2bd
    """

+    deduplicated_dataset = exact_duplicates.remove(dataset, duplicate_docs)


Should also include the perform_removal option above?

sarahyurick · 2025-02-04T19:42:08Z

nemo_curator/modules/exact_dedup.py

@@ -135,7 +136,7 @@ def hash_documents(
            # TODO: Generalize ty using self.hash_method
            return df.apply(lambda x: md5(x.encode()).hexdigest())

-    def __call__(self, dataset: DocumentDataset) -> Union[DocumentDataset, str]:
+    def identify(self, dataset: DocumentDataset) -> DocumentDataset:


Yes, that makes sense to me. What about calling it identify_duplicates?

sarahyurick · 2025-02-04T19:43:49Z

nemo_curator/utils/removal.py

Maybe call it "duplicates_removal" or something similar?

VibhuJawa

Thanks for pushing through this . This mostly looks good to me.

The only ask is not to Modify the __call__ header and behavior in this release. Everything else looks great.

nemo_curator/modules/exact_dedup.py

Signed-off-by: Praateek <[email protected]>

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>

praateekmahajan · 2025-02-06T19:41:38Z

nemo_curator/modules/fuzzy_dedup/connectedcomponents.py

-            # Ensure all docs in the same group are in the same partition
-            labels_df = labels_df.shuffle(on=["group"], ignore_index=True)


@ayushdg we're doing this here

ayushdg

Left some initial comments. Would also be interested in reviewing the remove_duplicates utility once that's uploaded

nemo_curator/modules/exact_dedup.py

Signed-off-by: Praateek <[email protected]>

VibhuJawa

LGTM (Assuming above reviews get through)

Signed-off-by: Praateek <[email protected]>

docs/user-guide/gpudeduplication.rst

nemo_curator/modules/exact_dedup.py

nemo_curator/modules/fuzzy_dedup/fuzzyduplicates.py

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>

sarahyurick · 2025-02-07T22:08:35Z

Actually, I will cancel gpuCI since the last run looks good and only docs were updated.

ayushdg

Thanks lgtm!

Once we update the defaults to include removal as well, tutorials and CI scripts will need updates as well.

Signed-off-by: Phillip Mobley <[email protected]>

* ci: Pin twine in release workflow (#512) * ci: Pin twine in release workflow Signed-off-by: oliver könig <[email protected]> * maybe fix? Signed-off-by: oliver könig <[email protected]> * fix Signed-off-by: oliver könig <[email protected]> --------- Signed-off-by: oliver könig <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * ci: Version bump to 0.7.0rc1.dev0 (#513) Signed-off-by: oliver könig <[email protected]> Co-authored-by: oliver könig <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Enforce Dataframe Backend Checks (#514) * Add module and to backend Signed-off-by: Ryan Wolf <[email protected]> * Add backend tests Signed-off-by: Ryan Wolf <[email protected]> * Fix tests Signed-off-by: Ryan Wolf <[email protected]> * Add switch backend tests Signed-off-by: Ryan Wolf <[email protected]> * Update modules to use module interface Signed-off-by: Ryan Wolf <[email protected]> * Directly invoke module init Signed-off-by: Ryan Wolf <[email protected]> * Fix call method Signed-off-by: Ryan Wolf <[email protected]> * Fix shuffle call method Signed-off-by: Ryan Wolf <[email protected]> * Add docs and more tests Signed-off-by: Ryan Wolf <[email protected]> * Fix list formatting in docs Signed-off-by: Ryan Wolf <[email protected]> * Address Sarah and Praateek's reviews Signed-off-by: Ryan Wolf <[email protected]> * Fix modifier get_backend to backend Signed-off-by: Ryan Wolf <[email protected]> * Address Ayush's review Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Updated documentation to include packaging requirements Signed-off-by: Phillip Mobley <[email protected]> * Fixed formatting issues. Signed-off-by: Phillip Mobley <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Enable ADD ID to work with CPU/GPU both (#479) * Enable ADD ID to work with CPU/GPU both Signed-off-by: Vibhu Jawa <[email protected]> * Make Test runable in a CPU only environment Signed-off-by: Vibhu Jawa <[email protected]> * Fix pytest skipping behavior in CPU/GPU environment Signed-off-by: Vibhu Jawa <[email protected]> * Raise error instead of skipping test Signed-off-by: Vibhu Jawa <[email protected]> --------- Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Add Pooling Strategy Option for embedding creation (#491) * Add pooling stratedgy Signed-off-by: Vibhu Jawa <[email protected]> * Ensure pytest is importable in a CPU only environment Signed-off-by: Vibhu Jawa <[email protected]> * Fix last token based on Avinash's feedback Signed-off-by: Vibhu Jawa <[email protected]> * Fix indexing issues Signed-off-by: Vibhu Jawa <[email protected]> * Merge in main Signed-off-by: Vibhu Jawa <[email protected]> * Fix Doc-string Signed-off-by: Vibhu Jawa <[email protected]> * Address Sarah's reviews Signed-off-by: Vibhu Jawa <[email protected]> --------- Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Add Partition On Logic (#519) * add partition_on logic Signed-off-by: Vibhu Jawa <[email protected]> * Add Docstring based on Sarah's review Signed-off-by: Vibhu Jawa <[email protected]> * Apply Praateek's suggestion and skip test with using pytest.mark.gpu Signed-off-by: Vibhu Jawa <[email protected]> * Apply Praateek's suggestion and force index=False Signed-off-by: Vibhu Jawa <[email protected]> --------- Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Add improved cleaning methods from Nemotron-CC (#517) * Add improved cleaning features Signed-off-by: Ryan Wolf <[email protected]> * Fix cleaning tests Signed-off-by: Ryan Wolf <[email protected]> * Update documentation and CLI scripts Signed-off-by: Ryan Wolf <[email protected]> * Address Sarah and Lawrence's reviews Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Update model nomenclature (#497) * Update model nomenclature Signed-off-by: Sarah Yurick <[email protected]> * minor notebook grammar Signed-off-by: Sarah Yurick <[email protected]> * add lawrence's suggestion Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * small add_id backend fix (#525) Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * benchmark readme updates (#508) * benchmark readme updates Signed-off-by: Lawrence Lane <[email protected]> * benchmark image update Signed-off-by: Lawrence Lane <[email protected]> * benchmark text update Signed-off-by: Lawrence Lane <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Removal logic for fuzzy / exact (no class abstraction) (#509) Signed-off-by: Phillip Mobley <[email protected]> * ci: Limit unit-test duration (#534) Signed-off-by: oliver könig <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Enforce Dataframe Backend Checks (#514) * Add module and to backend Signed-off-by: Ryan Wolf <[email protected]> * Add backend tests Signed-off-by: Ryan Wolf <[email protected]> * Fix tests Signed-off-by: Ryan Wolf <[email protected]> * Add switch backend tests Signed-off-by: Ryan Wolf <[email protected]> * Update modules to use module interface Signed-off-by: Ryan Wolf <[email protected]> * Directly invoke module init Signed-off-by: Ryan Wolf <[email protected]> * Fix call method Signed-off-by: Ryan Wolf <[email protected]> * Fix shuffle call method Signed-off-by: Ryan Wolf <[email protected]> * Add docs and more tests Signed-off-by: Ryan Wolf <[email protected]> * Fix list formatting in docs Signed-off-by: Ryan Wolf <[email protected]> * Address Sarah and Praateek's reviews Signed-off-by: Ryan Wolf <[email protected]> * Fix modifier get_backend to backend Signed-off-by: Ryan Wolf <[email protected]> * Address Ayush's review Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> * small add_id backend fix (#525) Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Removal logic for fuzzy / exact (no class abstraction) (#509) Signed-off-by: Phillip Mobley <[email protected]> --------- Signed-off-by: oliver könig <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]> Co-authored-by: oliver könig <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Ryan Wolf <[email protected]> Co-authored-by: Vibhu Jawa <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Co-authored-by: L.B. <[email protected]> Co-authored-by: Praateek Mahajan <[email protected]>

praateekmahajan added 7 commits January 28, 2025 12:37

fc

89b9005

Signed-off-by: Praateek <[email protected]>

add shuffle/ tests

37f6bee

Signed-off-by: Praateek <[email protected]>

more test for class

69c8955

Signed-off-by: Praateek <[email protected]>

pre-commit

de25476

Signed-off-by: Praateek <[email protected]>

remove class abstractions

a698bf0

Signed-off-by: Praateek <[email protected]>

remove unused import

a2e0c42

Signed-off-by: Praateek <[email protected]>

add __call__ methods back

845cae3

Signed-off-by: Praateek <[email protected]>

praateekmahajan requested review from sarahyurick and ryantwolf January 31, 2025 23:01

ryantwolf reviewed Jan 31, 2025

View reviewed changes

praateekmahajan mentioned this pull request Jan 31, 2025

Removal logic for Exact / Fuzzy Dedup #499

Closed

3 tasks

sarahyurick added the gpuci Run GPU CI/CD on PR label Feb 3, 2025

sarahyurick reviewed Feb 3, 2025

View reviewed changes

praateekmahajan added 2 commits February 3, 2025 15:02

change from modules / update docs

2a1da6b

Signed-off-by: Praateek <[email protected]>

add tests

48bef03

Signed-off-by: Praateek <[email protected]>

praateekmahajan requested review from ryantwolf and sarahyurick February 4, 2025 00:39

praateekmahajan added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Feb 4, 2025

praateekmahajan requested review from ayushdg and VibhuJawa February 4, 2025 00:48

update blocksize to 1024 in exact

958161d

Signed-off-by: Praateek <[email protected]>

ryantwolf requested changes Feb 4, 2025

View reviewed changes

nemo_curator/modules/exact_dedup.py Outdated Show resolved Hide resolved

nemo_curator/modules/fuzzy_dedup/fuzzyduplicates.py Show resolved Hide resolved

nemo_curator/modules/semantic_dedup/clusteringmodel.py Outdated Show resolved Hide resolved

sarahyurick reviewed Feb 4, 2025

View reviewed changes

VibhuJawa requested changes Feb 4, 2025

View reviewed changes

nemo_curator/modules/exact_dedup.py Outdated Show resolved Hide resolved

pr suggestions

7275609

Signed-off-by: Praateek <[email protected]>

praateekmahajan added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Feb 5, 2025

praateekmahajan requested review from VibhuJawa and ryantwolf February 5, 2025 22:12

Update nemo_curator/modules/exact_dedup.py

e41c5fa

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>

praateekmahajan commented Feb 6, 2025

View reviewed changes

ayushdg reviewed Feb 6, 2025

View reviewed changes

nemo_curator/modules/exact_dedup.py Show resolved Hide resolved

nemo_curator/modules/exact_dedup.py Outdated Show resolved Hide resolved

nemo_curator/modules/exact_dedup.py Outdated Show resolved Hide resolved

add file back

9c7f4bf

Signed-off-by: Praateek <[email protected]>

VibhuJawa approved these changes Feb 6, 2025

View reviewed changes

praateekmahajan added 5 commits February 6, 2025 14:36

merge

fe6f018

Signed-off-by: Praateek <[email protected]>

pre-commit

7f0da3e

Signed-off-by: Praateek <[email protected]>

forgot to rename back to identify_duplicates after merge

b438c80

Signed-off-by: Praateek <[email protected]>

renmaed func in call

f8040b5

Signed-off-by: Praateek <[email protected]>

split code / read fpp=1

82f0c6c

Signed-off-by: Praateek <[email protected]>

praateekmahajan mentioned this pull request Feb 7, 2025

Exact / Fuzzy Duplicate Removal Improvements at Scale #529

Open

praateekmahajan added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Feb 7, 2025

praateekmahajan requested review from ayushdg and sarahyurick February 7, 2025 19:40

ryantwolf approved these changes Feb 7, 2025

View reviewed changes

sarahyurick approved these changes Feb 7, 2025

View reviewed changes

docs/user-guide/gpudeduplication.rst Outdated Show resolved Hide resolved

nemo_curator/modules/exact_dedup.py Outdated Show resolved Hide resolved

nemo_curator/modules/fuzzy_dedup/fuzzyduplicates.py Outdated Show resolved Hide resolved

praateekmahajan and others added 3 commits February 7, 2025 12:35

Update docs/user-guide/gpudeduplication.rst

bf5498f

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>

Update nemo_curator/modules/fuzzy_dedup/fuzzyduplicates.py

f172c72

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>

Update nemo_curator/modules/exact_dedup.py

2beca67

Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>

sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Feb 7, 2025

ayushdg approved these changes Feb 7, 2025

View reviewed changes

Merge branch 'main' into praateek/removal-code-no-abstraction

f8d89da

praateekmahajan merged commit f642628 into NVIDIA:main Feb 8, 2025
4 checks passed

philm001 pushed a commit to philm001/NeMo-Curator that referenced this pull request Feb 10, 2025

Removal logic for fuzzy / exact (no class abstraction) (NVIDIA#509)

dfe010d

Signed-off-by: Phillip Mobley <[email protected]>

philm001 pushed a commit to philm001/NeMo-Curator that referenced this pull request Feb 10, 2025

Removal logic for fuzzy / exact (no class abstraction) (NVIDIA#509)

f097d95

Signed-off-by: Phillip Mobley <[email protected]>

philm001 pushed a commit to philm001/NeMo-Curator that referenced this pull request Feb 10, 2025

Removal logic for fuzzy / exact (no class abstraction) (NVIDIA#509)

39db13d

Signed-off-by: Phillip Mobley <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removal logic for fuzzy / exact (no class abstraction) #509

Removal logic for fuzzy / exact (no class abstraction) #509

praateekmahajan commented Jan 31, 2025 •

edited

Loading

ryantwolf left a comment

sarahyurick Feb 3, 2025

praateekmahajan Feb 3, 2025

sarahyurick Feb 4, 2025

praateekmahajan Feb 5, 2025

ryantwolf left a comment

sarahyurick Feb 4, 2025

sarahyurick Feb 4, 2025

sarahyurick Feb 4, 2025

VibhuJawa left a comment

praateekmahajan Feb 6, 2025

ayushdg left a comment

VibhuJawa left a comment •

edited

Loading

sarahyurick commented Feb 7, 2025

ayushdg left a comment

	def identify(self, dataset: DocumentDataset) -> DocumentDataset:
	def _identify(self, dataset: DocumentDataset) -> DocumentDataset:

		# Ensure all docs in the same group are in the same partition
		labels_df = labels_df.shuffle(on=["group"], ignore_index=True)

Removal logic for fuzzy / exact (no class abstraction) #509

Removal logic for fuzzy / exact (no class abstraction) #509

Conversation

praateekmahajan commented Jan 31, 2025 • edited Loading

Description

Removal Description

Class Description

Usage

Checklist

ryantwolf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryantwolf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayushdg left a comment

Choose a reason for hiding this comment

VibhuJawa left a comment • edited Loading

Choose a reason for hiding this comment

sarahyurick commented Feb 7, 2025

ayushdg left a comment

Choose a reason for hiding this comment

praateekmahajan commented Jan 31, 2025 •

edited

Loading

VibhuJawa left a comment •

edited

Loading