Merge branch 'main' into main

NVIDIA · Feb 7, 2025 · 64541f5 · 64541f5
2 parents ff93c8a + 34a1cc6
commit 64541f5
Show file tree

Hide file tree

Showing 42 changed files with 796 additions and 200 deletions.
diff --git a/README.md b/README.md
@@ -23,8 +23,8 @@ All of our text pipelines have great multilingual support.
 - [Download and Extraction](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/download.html)
   - Default implementations for Common Crawl, Wikipedia, and ArXiv sources
   - Easily customize and extend to other sources
-- [Language Identification](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/languageidentificationunicodeformatting.html)
-- [Unicode Reformatting](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/languageidentificationunicodeformatting.html)
+- [Language Identification](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/languageidentification.html)
+- [Text Cleaning](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/textcleaning.html)
 - [Heuristic Filtering](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html)
 - Classifier Filtering
   - [fastText](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html)

diff --git a/docs/user-guide/cpuvsgpu.rst b/docs/user-guide/cpuvsgpu.rst
@@ -69,10 +69,10 @@ The following NeMo Curator modules are GPU based.
 
   * Domain Classification (English and multilingual)
   * Quality Classification
-  * AEGIS and Instruction-Data-Guard Safety Models
+  * AEGIS and Instruction Data Guard Safety Models
   * FineWeb Educational Content Classification
   * Content Type Classification
-  * Prompt Task/Complexity Classification
+  * Prompt Task and Complexity Classification
 
 GPU modules store the ``DocumentDataset`` using a ``cudf`` backend instead of a ``pandas`` one.
 To read a dataset into GPU memory, one could use the following function call.

diff --git a/docs/user-guide/distributeddataclassification.rst b/docs/user-guide/distributeddataclassification.rst
@@ -15,7 +15,7 @@ NeMo Curator provides a module to help users run inference with pre-trained mode
 This is achieved by chunking the datasets across multiple computing nodes, each equipped with multiple GPUs, to accelerate the classification task in a distributed manner.
 Since the classification of a single text document is independent of other documents within the dataset, we can distribute the workload across multiple nodes and GPUs to perform parallel processing.
 
-Domain (English and multilingual), quality, content safety, educational content, content type, and prompt task/complexity models are tasks we include as examples within our module.
+Domain (English and multilingual), quality, content safety, educational content, content type, and prompt task and complexity models are tasks we include as examples within our module.
 
 Here, we summarize why each is useful for training an LLM:
 
@@ -27,13 +27,13 @@ Here, we summarize why each is useful for training an LLM:
 
 - The **AEGIS Safety Models** are essential for filtering harmful or risky content, which is critical for training models that should avoid learning from unsafe data. By classifying content into 13 critical risk categories, AEGIS helps remove harmful or inappropriate data from the training sets, improving the overall ethical and safety standards of the LLM.
 
-- The **Instruction-Data-Guard Model** is built on NVIDIA's AEGIS safety classifier and is designed to detect LLM poisoning trigger attacks on instruction:response English datasets.
+- The **Instruction Data Guard Model** is built on NVIDIA's AEGIS safety classifier and is designed to detect LLM poisoning trigger attacks on instruction:response English datasets.
 
 - The **FineWeb Educational Content Classifier** focuses on identifying and prioritizing educational material within datasets. This classifier is especially useful for training LLMs on specialized educational content, which can improve their performance on knowledge-intensive tasks. Models trained on high-quality educational content demonstrate enhanced capabilities on academic benchmarks such as MMLU and ARC, showcasing the classifier's impact on improving the knowledge-intensive task performance of LLMs.
 
 - The **Content Type Classifier** is designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types.
 
-- The **Prompt Task/Complexity Classifier** is a multi-headed model which classifies English text prompts across task types and complexity dimensions.
+- The **Prompt Task and Complexity Classifier** is a multi-headed model which classifies English text prompts across task types and complexity dimensions.
 
 -----------------------------------------
 Usage
@@ -95,8 +95,8 @@ Using the ``MultilingualDomainClassifier`` is very similar to using the ``Domain
 
 For more information about the multilingual domain classifier, including its supported languages, please see the `nvidia/multilingual-domain-classifier <https://huggingface.co/nvidia/multilingual-domain-classifier>`_ on Hugging Face.
 
-Quality Classifier
-^^^^^^^^^^^^^^^^^^
+Quality Classifier DeBERTa
+^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The Quality Classifier is designed to assess the quality of text documents, helping to filter out low-quality or noisy data from your dataset.
 
@@ -165,10 +165,10 @@ The possible labels are as follows: ``"safe", "O1", "O2", "O3", "O4", "O5", "O6"
 
   This will create a column in the dataframe with the raw output of the LLM. You can choose to parse this response however you want.
 
-Instruction-Data-Guard Model
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Instruction Data Guard
+^^^^^^^^^^^^^^^^^^^^^^
 
-Instruction-Data-Guard is a classification model designed to detect LLM poisoning trigger attacks.
+Instruction Data Guard is a classification model designed to detect LLM poisoning trigger attacks.
 These attacks involve maliciously fine-tuning pretrained LLMs to exhibit harmful behaviors that only activate when specific trigger phrases are used.
 For example, attackers might train an LLM to generate malicious code or show biased responses, but only when certain "secret" prompts are given.
 
@@ -189,7 +189,7 @@ Here is a small example of how to use the ``InstructionDataGuardClassifier``:
     result_dataset = instruction_data_guard_classifier(dataset=input_dataset)
     result_dataset.to_json("labeled_dataset/")
 
-In this example, the Instruction-Data-Guard model is obtained directly from `Hugging Face <https://huggingface.co/nvidia/instruction-data-guard>`_.
+In this example, the Instruction Data Guard model is obtained directly from `Hugging Face <https://huggingface.co/nvidia/instruction-data-guard>`_.
 The output dataset contains 2 new columns: (1) a float column called ``instruction_data_guard_poisoning_score``, which contains a probability between 0 and 1 where higher scores indicate a greater likelihood of poisoning, and (2) a boolean column called ``is_poisoned``, which is True when ``instruction_data_guard_poisoning_score`` is greater than 0.5 and False otherwise.
 
 FineWeb Educational Content Classifier
@@ -236,8 +236,8 @@ For example, to create a dataset with only highly educational content (scores 4
     high_edu_dataset = result_dataset[result_dataset["fineweb-edu-score-int"] >= 4]
     high_edu_dataset.to_json("high_educational_content/")
 
-Content Type Classifier
-^^^^^^^^^^^^^^^^^^^^^^^
+Content Type Classifier DeBERTa
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 The Content Type Classifier is used to categorize speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types.
 
@@ -258,10 +258,10 @@ Let's see how ``ContentTypeClassifier`` works in a small excerpt taken from ``ex
 In this example, the content type classifier is obtained directly from `Hugging Face <https://huggingface.co/nvidia/content-type-classifier-deberta>`_.
 It filters the input dataset to include only documents classified as "Blogs" or "News".
 
-Prompt Task/Complexity Classifier
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Prompt Task and Complexity Classifier
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-The Prompt Task/Complexity Classifier is a multi-headed model which classifies English text prompts across task types and complexity dimensions. Tasks are classified across 11 common categories. Complexity is evaluated across 6 dimensions and ensembled to create an overall complexity score.
+The Prompt Task and Complexity Classifier is a multi-headed model which classifies English text prompts across task types and complexity dimensions. Tasks are classified across 11 common categories. Complexity is evaluated across 6 dimensions and ensembled to create an overall complexity score.
 
 Here's an example of how to use the ``PromptTaskComplexityClassifier``:
 

diff --git a/docs/user-guide/index.rst b/docs/user-guide/index.rst
@@ -16,8 +16,11 @@ Text Curation
 :ref:`Document Filtering <data-curator-qualityfiltering>`
    This section describes how to use the 30+ heuristic and classifier filters available within the NeMo Curator and implement custom filters to apply to the documents within the corpora.
 
-:ref:`Language Identification and Unicode Fixing <data-curator-languageidentification>`
-   Large, unlabeled text corpora often contain a variety of languages. The NeMo Curator provides utilities to identify languages and fix improperly decoded Unicode characters.
+:ref:`Language Identification <data-curator-languageidentification>`
+   Large, unlabeled text corpora often contain a variety of languages. NeMo Curator provides utilities to identify languages.
+
+:ref:`Text Cleaning <data-curator-text-cleaning>`
+   Many parts of the Internet contained malformed or poorly formatted text. NeMo Curator can fix many of these issues with text.
 
 :ref:`GPU Accelerated Exact and Fuzzy Deduplication <data-curator-gpu-deduplication>`
    Both exact and fuzzy deduplication functionalities are supported in NeMo Curator and accelerated using RAPIDS cuDF.

diff --git a/...nguageidentificationunicodeformatting.rst → docs/user-guide/languageidentification.rst b/...nguageidentificationunicodeformatting.rst → docs/user-guide/languageidentification.rst
@@ -11,40 +11,17 @@ Background
 Large unlabeled text corpora often contain a variety of languages.
 However, data curation usually includes steps that are language specific (e.g. using language-tuned heuristics for quality filtering)
 and many curators are only interested in curating a monolingual dataset.
-Datasets also may have improperly decoded unicode characters (e.g. "The Mona Lisa doesn't have eyebrows." decoding as "The Mona Lisa doesnÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢t have eyebrows.").
 
-NeMo Curator provides utilities to identify languages and fix improperly decoded unicode characters.
-The language identification is performed using `fastText <https://fasttext.cc/docs/en/language-identification.html>`_ and unicode fixing is performed using `ftfy <https://ftfy.readthedocs.io/en/latest/>`_.
+NeMo Curator provides utilities to identify languages using `fastText <https://fasttext.cc/docs/en/language-identification.html>`_.
 Even though a preliminary language identification may have been performed on the unextracted text (as is the case in our Common Crawl pipeline
 using pyCLD2), `fastText <https://fasttext.cc/docs/en/language-identification.html>`_ is more accurate so it can be used for a second pass.
 
 -----------------------------------------
 Usage
 -----------------------------------------
 
-We provide an example of how to use the language identification and unicode reformatting utility at ``examples/identify_languages_and_fix_unicode.py``.
+We provide an example of how to use the language identification and unicode reformatting utility at ``examples/identify_languages.py``.
 At a high level, the module first identifies the languages of the documents and removes any documents for which it has high uncertainty about the language.
-Notably, this line uses one of the ``DocmentModifiers`` that NeMo Curator provides:
-
-.. code-block:: python
-
-  cleaner = nc.Modify(UnicodeReformatter())
-  cleaned_data = cleaner(lang_data)
-
-``DocumentModifier``s like ``UnicodeReformatter`` are very similar to ``DocumentFilter``s.
-They implement a single ``modify_document`` function that takes in a document and outputs a modified document.
-Here is the implementation of the ``UnicodeReformatter`` modifier:
-
-.. code-block:: python
-
-  class UnicodeReformatter(DocumentModifier):
-      def __init__(self):
-          super().__init__()
-
-      def modify_document(self, text: str) -> str:
-          return ftfy.fix_text(text)
-
-Also like the ``DocumentFilter`` functions, ``modify_document`` can be annotated with ``batched`` to take in a pandas series of documents instead of a single document.
 
 -----------------------------------------
 Related Scripts
@@ -79,15 +56,4 @@ within that file. Below is an example run command for :code:`separate_by_metadat
      --output-metadata-distribution=./data/lang_distro.json
 
 After running this module, the output directory will consist of one directory per language present within the corpus and all documents
-within those directories will contain text that originates from the same language. Finally, the text within a specific language can have
-its unicode fixed using the :code:`text_cleaning` module
-
-.. code-block:: bash
-
-    text_cleaning \
-      --input-data-dir=<Output directory containing sub-directories>/EN \
-      --output-clean-dir=<Output directory to which cleaned english documents will be written>
-
-
-The above :code:`text_cleaning` module uses the heuristics defined within the :code:`ftfy` package that is commonly used for fixing
-improperly decoded unicode.
+within those directories will contain text that originates from the same language.
diff --git a/docs/user-guide/text-cleaning.rst b/docs/user-guide/text-cleaning.rst
@@ -0,0 +1,98 @@
+.. _data-curator-text-cleaning:
+
+=========================
+Text Cleaning
+=========================
+
+--------------------
+Overview
+--------------------
+Use NeMo Curator's text cleaning modules to remove undesirable text such as improperly decoded unicode characters, inconsistent line spacing, or excessive URLs from documents being pre-processed for dataset.
+
+For example, the input sentence `"The Mona Lisa doesn't have eyebrows."` from a given document may not have included a properly encoded apostrophe (`'`), resulting in the sentence decoding as `"The Mona Lisa doesnÃƒÂ¢Ã¢â€šÂ¬Ã¢â€žÂ¢t have eyebrows."` NeMo Curator enables you to easily run this document through the default `UnicodeReformatter()` module to detect and remove the unwanted text,  or you can define your own custom unicode text cleaner tailored to your needs.
+
+--------------------
+Use Cases
+--------------------
+* Fix improperly decoded Unicode characters from webpages.
+* Standardize document layout by removing excessive newlines.
+* Remove URLs in documents.
+
+--------------------
+Modules
+--------------------
+NeMo Curator provides the following modules for cleaning text:
+
+- ``UnicodeReformatter()``: Uses [ftfy](https://ftfy.readthedocs.io/en/latest/) to fix broken Unicode characters. Modifies the "text" field of the dataset by default.
+- ``NewlineNormalizer()``: Uses regex to replace 3 or more consecutive newline characters in each document with only 2 newline characters.
+- ``UrlRemover()``: Uses regex to remove all urls in each document.
+
+You can use these modules individually or sequentially in a cleaning pipeline.
+
+Consider the following example, which loads a dataset (`books.jsonl`), steps through each module in a cleaning pipeline, and outputs the processed dataset as `cleaned_books.jsonl`:
+
+
+.. code-block:: python
+
+    from nemo_curator import Sequential, Modify, get_client
+    from nemo_curator.datasets import DocumentDataset
+    from nemo_curator.modifiers import UnicodeReformatter, UrlRemover, NewlineNormalizer
+
+    def main():
+        client = get_client(cluster_type="cpu")
+
+        dataset = DocumentDataset.read_json("books.jsonl")
+        cleaning_pipeline = Sequential([
+            Modify(UnicodeReformatter()),
+            Modify(NewlineNormalizer()),
+            Modify(UrlRemover()),
+        ])
+
+        cleaned_dataset = cleaning_pipeline(dataset)
+
+        cleaned_dataset.to_json("cleaned_books.jsonl")
+
+    if __name__ == "__main__":
+        main()
+
+You can also perform text cleaning operations using the CLI by running the `text_cleaning` command:
+
+.. code-block:: bash
+
+    text_cleaning \
+      --input-data-dir=/path/to/input/ \
+      --output-clean-dir=/path/to/output/ \
+      --normalize-newlines \
+      --remove-urls
+
+By default, the CLI will only perform unicode reformatting. Adding the ``--normalize-newlines`` and ``--remove-urls`` options add the other text cleaning options.
+
+------------------------
+Custom Text Cleaner
+------------------------
+It's easy to write your own custom text cleaner. The implementation of ``UnicodeReformatter`` can be used as an example.
+
+.. code-block:: python
+    import ftfy
+
+    from nemo_curator.modifiers import DocumentModifier
+
+
+    class UnicodeReformatter(DocumentModifier):
+        def __init__(self):
+            super().__init__()
+
+        def modify_document(self, text: str) -> str:
+            return ftfy.fix_text(text)
+
+Simply define a new class that inherits from ``DocumentModifier`` and define the constructor and ``modify_text`` method.
+Also, like the ``DocumentFilter`` class, ``modify_document`` can be annotated with ``batched`` to take in a pandas series of documents instead of a single document.
+See the :ref:`document filtering page <data-curator-qualityfiltering>` for more information.
+
+---------------------------
+Additional Resources
+---------------------------
+* `Single GPU Tutorial <https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb>`_
+* `ftfy <https://ftfy.readthedocs.io/en/latest/>`_
+* `Refined Web Paper <https://arxiv.org/abs/2306.01116>`_
+* `Nemotron-CC Paper <https://arxiv.org/abs/2412.02595>`_
diff --git a/docs/user-guide/text-curation.rst b/docs/user-guide/text-curation.rst
@@ -13,8 +13,11 @@ Text Curation
 :ref:`Document Filtering <data-curator-qualityfiltering>`
    This section describes how to use the 30+ heuristic and classifier filters available within the NeMo Curator and implement custom filters to apply to the documents within the corpora.
 
-:ref:`Language Identification and Unicode Fixing <data-curator-languageidentification>`
-   Large, unlabeled text corpora often contain a variety of languages. The NeMo Curator provides utilities to identify languages and fix improperly decoded Unicode characters.
+:ref:`Language Identification <data-curator-languageidentification>`
+   Large, unlabeled text corpora often contain a variety of languages. NeMo Curator provides utilities to identify languages.
+
+:ref:`Text Cleaning <data-curator-text-cleaning>`
+   Many parts of the Internet contained malformed or poorly formatted text. NeMo Curator can fix many of these issues with text.
 
 :ref:`GPU Accelerated Exact and Fuzzy Deduplication <data-curator-gpu-deduplication>`
    Both exact and fuzzy deduplication functionalities are supported in NeMo Curator and accelerated using RAPIDS cuDF.
@@ -43,7 +46,8 @@ Text Curation
    documentdataset.rst
    cpuvsgpu.rst
    qualityfiltering.rst
-   languageidentificationunicodeformatting.rst
+   languageidentification.rst
+   textcleaning.rst
    gpudeduplication.rst
    semdedup.rst
    syntheticdata.rst