Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Nemotron-CC quality classifiers #518

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

sarahyurick
Copy link
Collaborator

@sarahyurick sarahyurick commented Feb 4, 2025

Awaiting Hugging Face releases.

TODO:

  • tests/test_classifiers.py
  • README.md
  • tutorials/distributed_data_classification/README.md
  • Create notebook tutorial(s)
  • docs/user-guide/api/classifiers.rst
  • docs/user-guide/cpuvsgpu.rst
  • docs/user-guide/distributeddataclassification.rst
  • examples/classifiers/README.md
  • Example script(s)
  • nemo_curator/classifiers/init.py
  • Integration script(s)
  • nemo_curator/scripts/classifiers/README.md
  • Classifier script(s)
  • pyproject.toml

sarahyurick and others added 7 commits February 4, 2025 15:22
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick marked this pull request as ready for review February 7, 2025 19:08
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial review, mostly looks good.

Have requested changes around naming, class structure etc.

docs/user-guide/api/classifiers.rst Outdated Show resolved Hide resolved
docs/user-guide/api/classifiers.rst Outdated Show resolved Hide resolved
docs/user-guide/distributeddataclassification.rst Outdated Show resolved Hide resolved
docs/user-guide/distributeddataclassification.rst Outdated Show resolved Hide resolved
docs/user-guide/distributeddataclassification.rst Outdated Show resolved Hide resolved
docs/user-guide/distributeddataclassification.rst Outdated Show resolved Hide resolved
docs/user-guide/distributeddataclassification.rst Outdated Show resolved Hide resolved
docs/user-guide/distributeddataclassification.rst Outdated Show resolved Hide resolved
Comment on lines 52 to 54
model = AutoModelForSequenceClassification.from_pretrained(
self.path_or_name, torch_dtype=torch.bfloat16
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that we read FINEWEB_MIXTRAL_IDENTIFIER , FINEWEB_NEMOTRON_IDENTIFIER in bfloat16 now.

Do you know why we cant/dont do it for EDU classifier ?

Can you add a comment stating the reason for this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, this was how it was in the script that the Nemotron-CC developers used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do a quick benchmark with autocast. I think this should happen automatically when we use torch.autocast , which we use/or should use.

If the results on a dataset line up (for both accuracy and throughput) we can probably skip this fork.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just ran, the results look a bit different without the bfloat16 but still pretty similar to the ones from before. I have removed torch_dtype=torch.bfloat16 for now.

nemo_curator/classifiers/fineweb_edu.py Outdated Show resolved Hide resolved
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick requested a review from VibhuJawa February 7, 2025 20:10
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits around autocast and tokenizer types

Comment on lines 52 to 54
model = AutoModelForSequenceClassification.from_pretrained(
self.path_or_name, torch_dtype=torch.bfloat16
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do a quick benchmark with autocast. I think this should happen automatically when we use torch.autocast , which we use/or should use.

If the results on a dataset line up (for both accuracy and throughput) we can probably skip this fork.

nemo_curator/classifiers/fineweb_edu.py Outdated Show resolved Hide resolved
nemo_curator/classifiers/fineweb_edu.py Outdated Show resolved Hide resolved
nemo_curator/classifiers/fineweb_edu.py Outdated Show resolved Hide resolved
nemo_curator/classifiers/fineweb_edu.py Outdated Show resolved Hide resolved
Signed-off-by: Sarah Yurick <[email protected]>
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants