Skip to content

Commit

Permalink
Update nemo_curator/modules/exact_dedup.py
Browse files Browse the repository at this point in the history
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Praateek Mahajan <[email protected]>
  • Loading branch information
praateekmahajan and sarahyurick authored Feb 7, 2025
1 parent f172c72 commit 2beca67
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion nemo_curator/modules/exact_dedup.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ def identify_duplicates(self, dataset: DocumentDataset) -> DocumentDataset:
return DocumentDataset.read_parquet(
write_path,
backend=backend,
# we read with FPP=1 so that groups are read in whole (and don't exist across partitions)
# We read with files_per_partition=1 so that groups are read in whole (and do not exist across partitions)
files_per_partition=1,
blocksize=None,
)
Expand Down

0 comments on commit 2beca67

Please sign in to comment.