-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removal logic for fuzzy / exact (no class abstraction) #509
Removal logic for fuzzy / exact (no class abstraction) #509
Conversation
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, one final change though can you modify nemo_curator/modules/__init__.py
to include nemo_curator.modules.removal.remove_duplicates
in the __all__
? Model it after blend_datasets
if you need a reference.
nemo_curator/modules/exact_dedup.py
Outdated
@@ -135,7 +136,7 @@ def hash_documents( | |||
# TODO: Generalize ty using self.hash_method | |||
return df.apply(lambda x: md5(x.encode()).hexdigest()) | |||
|
|||
def __call__(self, dataset: DocumentDataset) -> Union[DocumentDataset, str]: | |||
def identify(self, dataset: DocumentDataset) -> DocumentDataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def identify(self, dataset: DocumentDataset) -> DocumentDataset: | |
def _identify(self, dataset: DocumentDataset) -> DocumentDataset: |
Nit, but maybe call them _identify
and _remove
if they are not intended to be accessed by the user directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep them exposed especially since remove won't work at scales where size of duplicate >> host memory, in which case the user will need to break down identify and remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense to me. What about calling it identify_duplicates
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, initially I thought it's slightly verbose, but another argument in favor of identify_duplicates
would be that in future we might want to expose identify_documents_to_keep
in which the distinction might be necessary
cc @ayushdg / @ryantwolf / @VibhuJawa
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, just a few comments.
@@ -82,9 +82,11 @@ After ensuring your dataset has a unique ID field (or creating one with the code | |||
107 doc_prefix-52271 0f763a2937d57b9d96bf9f220e55f2bd | |||
""" | |||
|
|||
deduplicated_dataset = exact_duplicates.remove(dataset, duplicate_docs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also include the perform_removal
option above?
nemo_curator/modules/exact_dedup.py
Outdated
@@ -135,7 +136,7 @@ def hash_documents( | |||
# TODO: Generalize ty using self.hash_method | |||
return df.apply(lambda x: md5(x.encode()).hexdigest()) | |||
|
|||
def __call__(self, dataset: DocumentDataset) -> Union[DocumentDataset, str]: | |||
def identify(self, dataset: DocumentDataset) -> DocumentDataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense to me. What about calling it identify_duplicates
?
nemo_curator/utils/removal.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call it "duplicates_removal" or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing through this . This mostly looks good to me.
The only ask is not to Modify the __call__
header and behavior in this release. Everything else looks great.
Signed-off-by: Praateek <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>
# Ensure all docs in the same group are in the same partition | ||
labels_df = labels_df.shuffle(on=["group"], ignore_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ayushdg we're doing this here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some initial comments. Would also be interested in reviewing the remove_duplicates utility once that's uploaded
Signed-off-by: Praateek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (Assuming above reviews get through)
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]> Signed-off-by: Praateek Mahajan <[email protected]>
Actually, I will cancel gpuCI since the last run looks good and only docs were updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks lgtm!
Once we update the defaults to include removal as well, tutorials and CI scripts will need updates as well.
Signed-off-by: Phillip Mobley <[email protected]>
Signed-off-by: Phillip Mobley <[email protected]>
Signed-off-by: Phillip Mobley <[email protected]>
* ci: Pin twine in release workflow (#512) * ci: Pin twine in release workflow Signed-off-by: oliver könig <[email protected]> * maybe fix? Signed-off-by: oliver könig <[email protected]> * fix Signed-off-by: oliver könig <[email protected]> --------- Signed-off-by: oliver könig <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * ci: Version bump to 0.7.0rc1.dev0 (#513) Signed-off-by: oliver könig <[email protected]> Co-authored-by: oliver könig <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Enforce Dataframe Backend Checks (#514) * Add module and to backend Signed-off-by: Ryan Wolf <[email protected]> * Add backend tests Signed-off-by: Ryan Wolf <[email protected]> * Fix tests Signed-off-by: Ryan Wolf <[email protected]> * Add switch backend tests Signed-off-by: Ryan Wolf <[email protected]> * Update modules to use module interface Signed-off-by: Ryan Wolf <[email protected]> * Directly invoke module init Signed-off-by: Ryan Wolf <[email protected]> * Fix call method Signed-off-by: Ryan Wolf <[email protected]> * Fix shuffle call method Signed-off-by: Ryan Wolf <[email protected]> * Add docs and more tests Signed-off-by: Ryan Wolf <[email protected]> * Fix list formatting in docs Signed-off-by: Ryan Wolf <[email protected]> * Address Sarah and Praateek's reviews Signed-off-by: Ryan Wolf <[email protected]> * Fix modifier get_backend to backend Signed-off-by: Ryan Wolf <[email protected]> * Address Ayush's review Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Updated documentation to include packaging requirements Signed-off-by: Phillip Mobley <[email protected]> * Fixed formatting issues. Signed-off-by: Phillip Mobley <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Enable ADD ID to work with CPU/GPU both (#479) * Enable ADD ID to work with CPU/GPU both Signed-off-by: Vibhu Jawa <[email protected]> * Make Test runable in a CPU only environment Signed-off-by: Vibhu Jawa <[email protected]> * Fix pytest skipping behavior in CPU/GPU environment Signed-off-by: Vibhu Jawa <[email protected]> * Raise error instead of skipping test Signed-off-by: Vibhu Jawa <[email protected]> --------- Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Add Pooling Strategy Option for embedding creation (#491) * Add pooling stratedgy Signed-off-by: Vibhu Jawa <[email protected]> * Ensure pytest is importable in a CPU only environment Signed-off-by: Vibhu Jawa <[email protected]> * Fix last token based on Avinash's feedback Signed-off-by: Vibhu Jawa <[email protected]> * Fix indexing issues Signed-off-by: Vibhu Jawa <[email protected]> * Merge in main Signed-off-by: Vibhu Jawa <[email protected]> * Fix Doc-string Signed-off-by: Vibhu Jawa <[email protected]> * Address Sarah's reviews Signed-off-by: Vibhu Jawa <[email protected]> --------- Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Add Partition On Logic (#519) * add partition_on logic Signed-off-by: Vibhu Jawa <[email protected]> * Add Docstring based on Sarah's review Signed-off-by: Vibhu Jawa <[email protected]> * Apply Praateek's suggestion and skip test with using pytest.mark.gpu Signed-off-by: Vibhu Jawa <[email protected]> * Apply Praateek's suggestion and force index=False Signed-off-by: Vibhu Jawa <[email protected]> --------- Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Add improved cleaning methods from Nemotron-CC (#517) * Add improved cleaning features Signed-off-by: Ryan Wolf <[email protected]> * Fix cleaning tests Signed-off-by: Ryan Wolf <[email protected]> * Update documentation and CLI scripts Signed-off-by: Ryan Wolf <[email protected]> * Address Sarah and Lawrence's reviews Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Update model nomenclature (#497) * Update model nomenclature Signed-off-by: Sarah Yurick <[email protected]> * minor notebook grammar Signed-off-by: Sarah Yurick <[email protected]> * add lawrence's suggestion Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * small add_id backend fix (#525) Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * benchmark readme updates (#508) * benchmark readme updates Signed-off-by: Lawrence Lane <[email protected]> * benchmark image update Signed-off-by: Lawrence Lane <[email protected]> * benchmark text update Signed-off-by: Lawrence Lane <[email protected]> --------- Signed-off-by: Lawrence Lane <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Removal logic for fuzzy / exact (no class abstraction) (#509) Signed-off-by: Phillip Mobley <[email protected]> * ci: Limit unit-test duration (#534) Signed-off-by: oliver könig <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Enforce Dataframe Backend Checks (#514) * Add module and to backend Signed-off-by: Ryan Wolf <[email protected]> * Add backend tests Signed-off-by: Ryan Wolf <[email protected]> * Fix tests Signed-off-by: Ryan Wolf <[email protected]> * Add switch backend tests Signed-off-by: Ryan Wolf <[email protected]> * Update modules to use module interface Signed-off-by: Ryan Wolf <[email protected]> * Directly invoke module init Signed-off-by: Ryan Wolf <[email protected]> * Fix call method Signed-off-by: Ryan Wolf <[email protected]> * Fix shuffle call method Signed-off-by: Ryan Wolf <[email protected]> * Add docs and more tests Signed-off-by: Ryan Wolf <[email protected]> * Fix list formatting in docs Signed-off-by: Ryan Wolf <[email protected]> * Address Sarah and Praateek's reviews Signed-off-by: Ryan Wolf <[email protected]> * Fix modifier get_backend to backend Signed-off-by: Ryan Wolf <[email protected]> * Address Ayush's review Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]> * small add_id backend fix (#525) Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> * Removal logic for fuzzy / exact (no class abstraction) (#509) Signed-off-by: Phillip Mobley <[email protected]> --------- Signed-off-by: oliver könig <[email protected]> Signed-off-by: Phillip Mobley <[email protected]> Signed-off-by: Ryan Wolf <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Lawrence Lane <[email protected]> Co-authored-by: oliver könig <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Ryan Wolf <[email protected]> Co-authored-by: Vibhu Jawa <[email protected]> Co-authored-by: Sarah Yurick <[email protected]> Co-authored-by: L.B. <[email protected]> Co-authored-by: Praateek Mahajan <[email protected]>
Description
See #529 for much more detailed analysis on the decision and followups
Removal Description
hash(id_col)
so forcing a shuffle again will result in a double shuffle at read time.Class Description
remove
method to Fuzzy/ExactDeduplicator that callsnemo_curator.modules.removal.remove_duplicates
call
toidentify_duplicates
.perform_removal
which by default is False, to retain old behavior. But when Truecall
removes the duplicates as well.Usage
# Add snippet demonstrating usage
Checklist