Skip to content

Commit

Permalink
Enforce Dataframe Backend Checks (NVIDIA#514)
Browse files Browse the repository at this point in the history
* Add module and to backend

Signed-off-by: Ryan Wolf <[email protected]>

* Add backend tests

Signed-off-by: Ryan Wolf <[email protected]>

* Fix tests

Signed-off-by: Ryan Wolf <[email protected]>

* Add switch backend tests

Signed-off-by: Ryan Wolf <[email protected]>

* Update modules to use module interface

Signed-off-by: Ryan Wolf <[email protected]>

* Directly invoke module init

Signed-off-by: Ryan Wolf <[email protected]>

* Fix call method

Signed-off-by: Ryan Wolf <[email protected]>

* Fix shuffle call method

Signed-off-by: Ryan Wolf <[email protected]>

* Add docs and more tests

Signed-off-by: Ryan Wolf <[email protected]>

* Fix list formatting in docs

Signed-off-by: Ryan Wolf <[email protected]>

* Address Sarah and Praateek's reviews

Signed-off-by: Ryan Wolf <[email protected]>

* Fix modifier get_backend to backend

Signed-off-by: Ryan Wolf <[email protected]>

* Address Ayush's review

Signed-off-by: Ryan Wolf <[email protected]>

---------

Signed-off-by: Ryan Wolf <[email protected]>
  • Loading branch information
ryantwolf authored and philm001 committed Feb 10, 2025
1 parent e7f064d commit 8a113f7
Show file tree
Hide file tree
Showing 3 changed files with 3 additions and 5 deletions.
2 changes: 1 addition & 1 deletion nemo_curator/modules/add_id.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ class AddId(BaseModule):
def __init__(
self, id_field, id_prefix: str = "doc_id", start_index: Optional[int] = None
) -> None:
super().__init__(input_backend="any")
super().__init__(input_backend="pandas")
self.id_field = id_field
self.id_prefix = id_prefix
self.start_index = start_index
Expand Down
2 changes: 1 addition & 1 deletion nemo_curator/modules/exact_dedup.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ def hash_documents(
# TODO: Generalize ty using self.hash_method
return df.apply(lambda x: md5(x.encode()).hexdigest())

def identify_duplicates(self, dataset: DocumentDataset) -> DocumentDataset:
def call(self, dataset: DocumentDataset) -> Union[DocumentDataset, str]:
"""
Find document ID's for exact duplicates in a given DocumentDataset
Parameters
Expand Down
4 changes: 1 addition & 3 deletions nemo_curator/modules/fuzzy_dedup/fuzzyduplicates.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,9 +131,7 @@ def __init__(
profile_dir=self.config.profile_dir,
)

def identify_duplicates(
self, dataset: DocumentDataset
) -> Optional[DocumentDataset]:
def call(self, dataset: DocumentDataset):
"""
Parameters
----------
Expand Down

0 comments on commit 8a113f7

Please sign in to comment.