Add support for Nemotron-CC quality classifiers #518

sarahyurick · 2025-02-04T23:23:49Z

Awaiting Hugging Face releases.

TODO:

Signed-off-by: Sarah Yurick <[email protected]>

VibhuJawa

Did an initial review, mostly looks good.

Have requested changes around naming, class structure etc.

docs/user-guide/api/classifiers.rst

docs/user-guide/distributeddataclassification.rst

VibhuJawa · 2025-02-07T19:24:43Z

nemo_curator/classifiers/fineweb_edu.py

+            model = AutoModelForSequenceClassification.from_pretrained(
+                self.path_or_name, torch_dtype=torch.bfloat16
+            )


Interesting that we read FINEWEB_MIXTRAL_IDENTIFIER , FINEWEB_NEMOTRON_IDENTIFIER in bfloat16 now.

Do you know why we cant/dont do it for EDU classifier ?

Can you add a comment stating the reason for this

Not sure, this was how it was in the script that the Nemotron-CC developers used.

Can we do a quick benchmark with autocast. I think this should happen automatically when we use torch.autocast , which we use/or should use.

If the results on a dataset line up (for both accuracy and throughput) we can probably skip this fork.

Just ran, the results look a bit different without the bfloat16 but still pretty similar to the ones from before. I have removed torch_dtype=torch.bfloat16 for now.

nemo_curator/classifiers/fineweb_edu.py

Signed-off-by: Sarah Yurick <[email protected]>

VibhuJawa

Minor nits around autocast and tokenizer types

VibhuJawa · 2025-02-07T20:14:00Z

nemo_curator/classifiers/fineweb_edu.py

+            model = AutoModelForSequenceClassification.from_pretrained(
+                self.path_or_name, torch_dtype=torch.bfloat16
+            )


Can we do a quick benchmark with autocast. I think this should happen automatically when we use torch.autocast , which we use/or should use.

If the results on a dataset line up (for both accuracy and throughput) we can probably skip this fork.

nemo_curator/classifiers/fineweb_edu.py

Signed-off-by: Sarah Yurick <[email protected]>

VibhuJawa

LGTM

sarahyurick and others added 7 commits February 4, 2025 15:22

add fineweb mixtral classifier

88675f6

Signed-off-by: Sarah Yurick <[email protected]>

add more files

910cbf2

Signed-off-by: Sarah Yurick <[email protected]>

run black

1b369cb

Signed-off-by: Sarah Yurick <[email protected]>

create _FineWebBaseClassifier

d796adf

Signed-off-by: Sarah Yurick <[email protected]>

add more docs

56120c6

Signed-off-by: Sarah Yurick <[email protected]>

Merge branch 'main' into nemotron-cc-classifiers

c6da7f7

Signed-off-by: Sarah Yurick <[email protected]>

add notebooks and tests

23a98bb

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick marked this pull request as ready for review February 7, 2025 19:08

VibhuJawa requested changes Feb 7, 2025

View reviewed changes

sarahyurick added 2 commits February 7, 2025 11:56

update classifier names

ab54034

Signed-off-by: Sarah Yurick <[email protected]>

fix label logic

c929cbf

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick requested a review from VibhuJawa February 7, 2025 20:10

VibhuJawa requested changes Feb 7, 2025

View reviewed changes

add Vibhu's suggestions

3bd9598

Signed-off-by: Sarah Yurick <[email protected]>

VibhuJawa approved these changes Feb 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Nemotron-CC quality classifiers #518

Add support for Nemotron-CC quality classifiers #518

sarahyurick commented Feb 4, 2025 •

edited

Loading

VibhuJawa left a comment

VibhuJawa Feb 7, 2025

sarahyurick Feb 7, 2025

VibhuJawa Feb 7, 2025

sarahyurick Feb 7, 2025

VibhuJawa left a comment

VibhuJawa Feb 7, 2025

VibhuJawa left a comment

Add support for Nemotron-CC quality classifiers #518

Are you sure you want to change the base?

Add support for Nemotron-CC quality classifiers #518

Conversation

sarahyurick commented Feb 4, 2025 • edited Loading

VibhuJawa left a comment

Choose a reason for hiding this comment

VibhuJawa Feb 7, 2025

Choose a reason for hiding this comment

sarahyurick Feb 7, 2025

Choose a reason for hiding this comment

VibhuJawa Feb 7, 2025

Choose a reason for hiding this comment

sarahyurick Feb 7, 2025

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

VibhuJawa Feb 7, 2025

Choose a reason for hiding this comment

VibhuJawa left a comment

Choose a reason for hiding this comment

sarahyurick commented Feb 4, 2025 •

edited

Loading