-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removal logic for fuzzy / exact (no class abstraction) #509
Merged
praateekmahajan
merged 33 commits into
NVIDIA:main
from
praateekmahajan:praateek/removal-code-no-abstraction
Feb 8, 2025
Merged
Changes from 7 commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
89b9005
fc
praateekmahajan 37f6bee
add shuffle/ tests
praateekmahajan 69c8955
more test for class
praateekmahajan de25476
pre-commit
praateekmahajan a698bf0
remove class abstractions
praateekmahajan a2e0c42
remove unused import
praateekmahajan 845cae3
add __call__ methods back
praateekmahajan 2a1da6b
change from modules / update docs
praateekmahajan 48bef03
add tests
praateekmahajan 958161d
update blocksize to 1024 in exact
praateekmahajan 7275609
pr suggestions
praateekmahajan cba7fcd
warning
praateekmahajan bcb7cea
Update docs/user-guide/gpudeduplication.rst
praateekmahajan c929927
Update docs/user-guide/gpudeduplication.rst
praateekmahajan 0afd1a1
Update docs/user-guide/gpudeduplication.rst
praateekmahajan 6f1e4d9
Update examples/exact_deduplication.py
praateekmahajan 1347e37
Update examples/exact_deduplication.py
praateekmahajan 2e3c908
Update examples/fuzzy_deduplication.py
praateekmahajan bc20a5d
Update examples/fuzzy_deduplication.py
praateekmahajan 6e26edb
Update examples/fuzzy_deduplication.py
praateekmahajan 8ba196a
Update nemo_curator/modules/config.py
praateekmahajan 8936ac9
Update nemo_curator/modules/config.py
praateekmahajan e41c5fa
Update nemo_curator/modules/exact_dedup.py
praateekmahajan 9c7f4bf
add file back
praateekmahajan fe6f018
merge
praateekmahajan 7f0da3e
pre-commit
praateekmahajan b438c80
forgot to rename back to identify_duplicates after merge
praateekmahajan f8040b5
renmaed func in call
praateekmahajan 82f0c6c
split code / read fpp=1
praateekmahajan bf5498f
Update docs/user-guide/gpudeduplication.rst
praateekmahajan f172c72
Update nemo_curator/modules/fuzzy_dedup/fuzzyduplicates.py
praateekmahajan 2beca67
Update nemo_curator/modules/exact_dedup.py
praateekmahajan f8d89da
Merge branch 'main' into praateek/removal-code-no-abstraction
praateekmahajan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -125,8 +125,6 @@ def _run_connected_components( | |
f"# rows in labels_df = {len(labels_df)}" | ||
) | ||
assert num_nodes == len(labels_df) | ||
# Ensure all docs in the same group are in the same partition | ||
labels_df = labels_df.shuffle(on=["group"], ignore_index=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ayushdg we're doing this here |
||
labels_df.to_parquet(output_path, write_index=False, overwrite=True) | ||
Comms.destroy() | ||
self._logger.info( | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
praateekmahajan marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
import dask.dataframe as dd | ||
|
||
|
||
def remove_duplicates( | ||
left: dd.DataFrame, | ||
duplicates: dd.DataFrame, | ||
id_field: str, | ||
group_field: str, | ||
) -> dd.DataFrame: | ||
if left.npartitions < duplicates.npartitions: | ||
msg = ( | ||
"The number of partitions in `left` is less than the number of partitions in the duplicates dataset. " | ||
"This may lead to a shuffle join. Please re-read left and right with different partition sizes, or repartition left / right." | ||
) | ||
raise ValueError(msg) | ||
|
||
# Create a new column name for temporary ID storage during merge | ||
new_id_field = f"{id_field}_new" | ||
|
||
duplicates_to_remove = ( | ||
duplicates | ||
# Redistribute data across partitions so that all duplicates are in same partition | ||
.shuffle(on=[group_field], ignore_index=True) | ||
# For each partition, keep only the duplicated rows (excluding first occurrence) | ||
.map_partitions(lambda x: x[x[group_field].duplicated(keep="first")]).drop( | ||
columns=group_field | ||
) | ||
# Rename the ID field to avoid conflicts in the upcoming merge | ||
.rename(columns={id_field: new_id_field})[[new_id_field]] | ||
) | ||
|
||
merge = left.merge( | ||
right=duplicates_to_remove, | ||
how="left", | ||
broadcast=True, # Broadcast smaller DataFrame to all partitions | ||
left_on=id_field, | ||
right_on=new_id_field, | ||
) | ||
|
||
# This effectively removes all rows that were not in duplicates_to_remove | ||
removed_result = merge[merge[new_id_field].isna()].drop(columns=[new_id_field]) | ||
return removed_result |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
import random | ||
|
||
import pandas as pd | ||
import pytest | ||
from dask import dataframe as dd | ||
|
||
from nemo_curator.modules.removal import remove_duplicates | ||
|
||
|
||
@pytest.fixture() | ||
def ids(): | ||
# Dataset has id a0...a9, b0...b9, c0...c9, d0...d9 | ||
l = [f"{group}{i}" for group in ["a", "b", "c", "d"] for i in range(10)] | ||
# We shuffle it to make sure all duplicates are not in the same partition | ||
random.shuffle(l) | ||
return l | ||
|
||
|
||
@pytest.fixture | ||
def sample_data(ids): | ||
df = pd.DataFrame( | ||
{ | ||
"id": ids, | ||
"text": [f"text for {_id}" for _id in ids], | ||
} | ||
) | ||
return dd.from_pandas(df, npartitions=4) | ||
|
||
|
||
@pytest.fixture | ||
def duplicate_data(ids): | ||
# In each group we want to keep only the first occurrence (e.g. a1, b1, c1, d1) | ||
df = pd.DataFrame([{"id": _id, "group": _id[0]} for _id in ids]) | ||
# Shuffle to make sure all duplicates are not in the same partition | ||
return dd.from_pandas(df, npartitions=2) | ||
|
||
|
||
def test_remove_duplicates_basic( | ||
sample_data: dd.DataFrame, duplicate_data: dd.DataFrame | ||
): | ||
# Test basic duplicate removal functionality | ||
result = remove_duplicates( | ||
left=sample_data, duplicates=duplicate_data, id_field="id", group_field="group" | ||
) | ||
|
||
result = result.compute() | ||
|
||
assert list(result.columns) == ["id", "text"] | ||
assert len(result) == 4 | ||
# It's not guaranteed that we'll have a0, b0, c0, d0 in the result | ||
# So we should check the first character | ||
assert set(result["id"].apply(lambda x: x[0]).tolist()) == set(["a", "b", "c", "d"]) | ||
|
||
|
||
def test_remove_duplicates_all_duplicates(ids: list[str], sample_data: dd.DataFrame): | ||
duplicates = dd.from_pandas( | ||
pd.DataFrame({"id": ids, "group": [1] * len(ids)}), npartitions=2 | ||
) | ||
|
||
result = remove_duplicates( | ||
left=sample_data, duplicates=duplicates, id_field="id", group_field="group" | ||
) | ||
|
||
result = result.compute() | ||
assert list(result.columns) == ["id", "text"] | ||
# Should keep only one of the occurrences | ||
assert len(result) == 1 | ||
|
||
|
||
def test_not_remove_duplicates_unique(ids: list[str], sample_data: dd.DataFrame): | ||
# We create a dataset where first 30 ids are in one group | ||
# Next 9 ids are in distinct groups | ||
# And last id is not mentioned in duplicates | ||
|
||
duplicates = dd.from_pandas( | ||
pd.DataFrame( | ||
{ | ||
"id": ids[:30] + ids[30:39], | ||
"group": ["group0"] * 30 + [f"group{i}" for i in range(1, 10)], | ||
} | ||
), | ||
npartitions=2, | ||
) | ||
result = remove_duplicates( | ||
left=sample_data, duplicates=duplicates, id_field="id", group_field="group" | ||
) | ||
|
||
result = result.compute() | ||
assert list(result.columns) == ["id", "text"] | ||
# It has 1 row from the first group of 30 | ||
# 9 rows from the 9 distinct groups | ||
# And 1 row from the last group which is not included in set of duplicates | ||
assert len(result) == 1 + 9 + 1 | ||
# The last 10 ids should be in the result, there would be one more from the first 30 | ||
assert set(ids[30:]).issubset(set(result["id"].tolist())) | ||
|
||
|
||
def test_remove_duplicates_raise_error(): | ||
# Create sample dataframes with specific partition counts | ||
df1 = dd.from_pandas( | ||
pd.DataFrame({"id": ["a1", "a2", "a3"], "text": ["text1", "text2", "text3"]}), | ||
npartitions=2, | ||
) # dataset with 2 partitions | ||
|
||
duplicates = dd.from_pandas( | ||
pd.DataFrame( | ||
{"id": ["a1", "a2", "a3"], "group": ["group1", "group1", "group1"]} | ||
), | ||
npartitions=3, | ||
) # duplicates dataset with 3 partitions | ||
|
||
# Test that it raises ValueError when right npartitions are greater than left npartitions | ||
with pytest.raises(ValueError) as exc_info: | ||
remove_duplicates( | ||
left=df1, | ||
duplicates=duplicates, | ||
id_field="id", | ||
group_field="group", | ||
) | ||
|
||
expected_msg = ( | ||
"The number of partitions in `left` is less than the number of partitions in the duplicates dataset. " | ||
"This may lead to a shuffle join. Please re-read left and right with different partition sizes, or repartition left / right." | ||
) | ||
assert str(exc_info.value) == expected_msg |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but maybe call them
_identify
and_remove
if they are not intended to be accessed by the user directly.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep them exposed especially since remove won't work at scales where size of duplicate >> host memory, in which case the user will need to break down identify and remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense to me. What about calling it
identify_duplicates
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, initially I thought it's slightly verbose, but another argument in favor of
identify_duplicates
would be that in future we might want to exposeidentify_documents_to_keep
in which the distinction might be necessarycc @ayushdg / @ryantwolf / @VibhuJawa