Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Nemotron CC SDG Pipelines and Pre-processing/Post-Processing Stages #527

Merged
merged 23 commits into from
Feb 12, 2025
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/user-guide/api/filters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,14 @@ Heuristic Filters
:members:
:member-order: bysource

.. autoclass:: nemo_curator.filters.TokenCountFilter
:members:
:member-order: bysource

.. autoclass:: nemo_curator.filters.SubstringFilter
:members:
:member-order: bysource

------------------------------
Code Filters
------------------------------
Expand Down
6 changes: 6 additions & 0 deletions docs/user-guide/api/misc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,9 @@ Miscellaneous

.. autoclass:: nemo_curator.Shuffle
:members:

.. autoclass:: nemo_curator.DocumentSplitter
:members:

.. autoclass:: nemo_curator.DocumentJoiner
:members:
19 changes: 19 additions & 0 deletions docs/user-guide/api/modifiers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,22 @@ Modifiers

.. autoclass:: nemo_curator.modifiers.PiiModifier
:members:

.. autoclass:: nemo_curator.modifiers.LineRemover
:members:

.. autoclass:: nemo_curator.modifiers.MarkdownRemover
:members:

.. autoclass:: nemo_curator.modifiers.NewlineNormalizer
:members:

.. autoclass:: nemo_curator.modifiers.UrlRemover
:members:

.. autoclass:: nemo_curator.modifiers.Slicer
:members:

.. autoclass:: nemo_curator.modifiers.QuotationRemover
:members:

12 changes: 12 additions & 0 deletions docs/user-guide/api/synthetic.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,18 @@ Synthetic Data
.. autoclass:: nemo_curator.synthetic.AsyncNemotronGenerator
:members:

.. autoclass:: nemo_curator.synthetic.NemotronCCGenerator
:members:

.. autoclass:: nemo_curator.synthetic.NemotronCCDiverseQAPostprocessor
:members:

.. autoclass:: nemo_curator.synthetic.NemotronCCKnowledgeListPostprocessor
:members:

.. autoclass:: nemo_curator.synthetic.AsyncNemotronGenerator
:members:

.. autoclass:: nemo_curator.synthetic.NemotronFormatter
:members:

Expand Down
265 changes: 265 additions & 0 deletions docs/user-guide/syntheticdata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ Furthermore, NeMo Curator can also interface with `NeMo's Export and Deploy <htt
module which allows you to host your own model for LLM inference.

NeMo Curator offers prebuilt synthetic data generation pipelines for Supervised Fine-Tuning (SFT) and preference data, which were used to generate data for training `Nemotron-4 340B <https://research.nvidia.com/publication/2024-06_nemotron-4-340b>`_.
It also now supports the pipelines used in generating `Nemotron-CC <https://arxiv.org/abs/2412.02595>`_.
Additionally, you can seamlessly integrate filtering and deduplication steps in your synthetic data pipeline with the other modules available in NeMo Curator.

Connect to an LLM Service
Expand Down Expand Up @@ -690,6 +691,270 @@ All of the code so far has been sending requests to the LLM service synchronousl
As you can see, the asynchronous modules have the same interface as the synchronous modules.
The only exception is that a ``max_concurrent_requests`` parameter can be supplied to the constructor of ``AsyncNemotronGenerator`` as a form of rate limiting if your service is rate limited.

Customize the Nemotron-CC Pipeline
-----------------------------------

Nemotron-CC used a collection of pipelines focused on rephrasing reference documents into different formats/styles.
ryantwolf marked this conversation as resolved.
Show resolved Hide resolved
NeMo Curator provides a synchronous and asynchronous version of each pipeline with ``nemo_curator.synthetic.NemotronCCGenerator`` and ``nemo_curator.synthetic.AsyncNemotronCCGenerator``.

Rewrite to Wikipedia Style
##########################

The ``NemotronCCGenerator.rewrite_to_wikipedia_style`` method rewrites a document into a style that is similar to Wikipedia.
ryantwolf marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: python

from openai import OpenAI
from nemo_curator import OpenAIClient
from nemo_curator.synthetic import NemotronCCGenerator

openai_client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="<insert NVIDIA API key>"
)
client = OpenAIClient(openai_client)
generator = NemotronCCGenerator(client)

document = "The moon is bright. It shines at night."
model = "nv-mistralai/mistral-nemo-12b-instruct"
model_kwargs = {
"temperature": 0.5,
"top_p": 0.9,
"max_tokens": 512,
}

responses = generator.rewrite_to_wikipedia_style(
document=document, model=model, model_kwargs=model_kwargs
)

print(responses[0])
# Output:
# The lunar surface has a high albedo, which means it reflects a significant amount of sunlight.


Generate Diverse QA Pairs
#########################

The ``NemotronCCGenerator.generate_diverse_qa`` method generates a list of diverse QA pairs from a document.
ryantwolf marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: python

from openai import OpenAI
from nemo_curator import OpenAIClient
from nemo_curator.synthetic import NemotronCCGenerator

openai_client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="<insert NVIDIA API key>"
)
client = OpenAIClient(openai_client)
generator = NemotronCCGenerator(client)

document = "The moon is bright. It shines at night."
model = "nv-mistralai/mistral-nemo-12b-instruct"
model_kwargs = {
"temperature": 0.5,
"top_p": 0.9,
"max_tokens": 600,
}

responses = generator.generate_diverse_qa(
document=document, model=model, model_kwargs=model_kwargs
)

print(responses[0])
# Output:
# Question: What is the moon made of?
# Answer: The moon is made of rock and dust.


To help with cleaning the output, the ``NemotronCCDiverseQAPostprocessor`` class is provided.
ryantwolf marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: python

import pandas as pd
from openai import OpenAI
from nemo_curator import OpenAIClient
from nemo_curator.datasets import DocumentDataset
from nemo_curator.synthetic import NemotronCCGenerator, NemotronCCDiverseQAPostprocessor

openai_client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="<insert NVIDIA API key>"
)
client = OpenAIClient(openai_client)
generator = NemotronCCGenerator(client)

document = "The moon is bright. It shines at night."
model = "nv-mistralai/mistral-nemo-12b-instruct"
model_kwargs = {
"temperature": 0.5,
"top_p": 0.9,
"max_tokens": 600,
}
responses = generator.generate_diverse_qa(document=document, model=model, model_kwargs=model_kwargs)
postprocessor = NemotronCCDiverseQAPostprocessor(text_field="text", response_field="diverse_qa_response")
dataset = DocumentDataset.from_pandas(pd.DataFrame({"text": document, "diverse_qa_response": responses}))
cleaned_dataset = postprocessor(dataset)

first_entry = cleaned_dataset.df.head(1)
print(first_entry["diverse_qa_response"])
# Output:
# The moon is bright. It shines at night. Question: What is the moon made of? Answer: The moon is made of rock and dust.


Generate Knowledge List
#######################

The ``NemotronCCGenerator.generate_knowledge_list`` method generates a list of knowledge from a document.
ryantwolf marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: python

from openai import OpenAI
from nemo_curator import OpenAIClient
from nemo_curator.synthetic import NemotronCCGenerator

openai_client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="<insert NVIDIA API key>"
)
client = OpenAIClient(openai_client)
generator = NemotronCCGenerator(client)

document = "The moon is bright. It shines at night."
model = "nv-mistralai/mistral-nemo-12b-instruct"
model_kwargs = {
"temperature": 0.5,
"top_p": 0.9,
"max_tokens": 600,
}

responses = generator.generate_knowledge_list(
document=document, model=model, model_kwargs=model_kwargs
)

print(responses[0])
# Output:
# - The moon is made of rock and dust.
# - The moon is the only natural satellite of the Earth.
# ...

To help with cleaning the output, the ``NemotronCCKnowledgeListPostprocessor`` class is provided.
ryantwolf marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: python

import pandas as pd
from openai import OpenAI

from nemo_curator import OpenAIClient
from nemo_curator.datasets import DocumentDataset
from nemo_curator.synthetic import NemotronCCGenerator, NemotronCCKnowledgeListPostprocessor

openai_client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="<insert NVIDIA API key>"
)
client = OpenAIClient(openai_client)
generator = NemotronCCGenerator(client)

document = "The moon is bright. It shines at night."
model = "nv-mistralai/mistral-nemo-12b-instruct"
model_kwargs = {
"temperature": 0.5,
"top_p": 0.9,
"max_tokens": 600,
}

responses = generator.generate_knowledge_list(
document=document, model=model, model_kwargs=model_kwargs
)

print(responses[0])
# Output:
# - The moon is made of rock and dust.
# - The moon is the only natural satellite of the Earth.
# ...

postprocessor = NemotronCCKnowledgeListPostprocessor(text_field="knowledge_list_response")
dataset = DocumentDataset.from_pandas(pd.DataFrame({"knowledge_list_response": responses}))
cleaned_dataset = postprocessor(dataset)

first_entry = cleaned_dataset.df.head(1)
print(first_entry["knowledge_list_response"])
# Output:
# The moon is made of rock and dust.
# The moon is the only natural satellite of the Earth.

Distill Document
#################

The ``NemotronCCGenerator.distill`` method distills a document into a more concise form.
ryantwolf marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: python

from openai import OpenAI
from nemo_curator import OpenAIClient
from nemo_curator.synthetic import NemotronCCGenerator

openai_client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="<insert NVIDIA API key>"
)
client = OpenAIClient(openai_client)
generator = NemotronCCGenerator(client)

document = "The moon is bright. It shines at night."
model = "nv-mistralai/mistral-nemo-12b-instruct"
model_kwargs = {
"temperature": 0.5,
"top_p": 0.9,
"max_tokens": 1600,
}

responses = generator.distill(
document=document, model=model, model_kwargs=model_kwargs
)

print(responses[0])
# Output:
# The moon is bright at night.


Extract Knowledge
################

The ``NemotronCCGenerator.extract_knowledge`` method extracts knowledge from a document.
ryantwolf marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: python

from openai import OpenAI
from nemo_curator import OpenAIClient
from nemo_curator.synthetic import NemotronCCGenerator

openai_client = OpenAI(
base_url="https://integrate.api.nvidia.com/v1",
api_key="<insert NVIDIA API key>"
)
client = OpenAIClient(openai_client)
generator = NemotronCCGenerator(client)

document = "The moon is bright. It shines at night."
model = "nv-mistralai/mistral-nemo-12b-instruct"
model_kwargs = {
"temperature": 0.5,
"top_p": 0.9,
"max_tokens": 1400,
}

responses = generator.extract_knowledge(
document=document, model=model, model_kwargs=model_kwargs
)

print(responses[0])
# Output:
# The moon is a reflective body visible from the Earth at night.


Combine Synthetic Data Generation with other NeMo Curator Modules
-----------------------------------------------------------------
Synthetic data generation, unlike the rest of NeMo Curator, operates independently of Dask.
Expand Down
4 changes: 4 additions & 0 deletions nemo_curator/filters/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,9 @@
RepeatedParagraphsFilter,
RepeatingDuplicateNGramsFilter,
RepeatingTopNGramsFilter,
SubstringFilter,
SymbolsToWordsFilter,
TokenCountFilter,
UrlsFilter,
WhiteSpaceFilter,
WordCountFilter,
Expand Down Expand Up @@ -98,4 +100,6 @@
"QualityEstimationFilter",
"AnswerabilityFilter",
"EasinessFilter",
"TokenCountFilter",
"SubstringFilter",
]
Loading
Loading