Skip to content

Commit

Permalink
fix reference
Browse files Browse the repository at this point in the history
  • Loading branch information
dylanbouchard committed Dec 3, 2024
1 parent 7f5b023 commit 839f2ae
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ Class | Risk Assessed | Applicable Tasks |


### Toxicity Metrics
The `ToxicityMetrics` class facilitates simple computation of toxicity metrics from a user-provided list of LLM responses. These metrics leverage a pre-trained toxicity classifier that maps a text input to a toxicity score ranging from 0 to 1 [@Gehman2020RealToxicityPromptsEN, @liang2023holisticevaluationlanguagemodels]. For off-the-shelf toxicity classifiers, the `ToxicityMetrics` class provides four options: two classifiers from the `detoxify` package, `roberta-hate-speech-dynabench-r4-target` from the `evaluate` package, and `toxigen` available on HuggingFace.^[https://github.com/unitaryai/detoxify; https://github.com/huggingface/evaluate; https://github.com/microsoft/TOXIGEN] For additional flexibility, users can specify an ensemble of the off-the-shelf classifiers offered or provide a custom toxicity classifier object.
The `ToxicityMetrics` class facilitates simple computation of toxicity metrics from a user-provided list of LLM responses. These metrics leverage a pre-trained toxicity classifier that maps a text input to a toxicity score ranging from 0 to 1 [@Gehman2020RealToxicityPromptsEN; @liang2023holisticevaluationlanguagemodels]. For off-the-shelf toxicity classifiers, the `ToxicityMetrics` class provides four options: two classifiers from the `detoxify` package, `roberta-hate-speech-dynabench-r4-target` from the `evaluate` package, and `toxigen` available on HuggingFace.^[https://github.com/unitaryai/detoxify; https://github.com/huggingface/evaluate; https://github.com/microsoft/TOXIGEN] For additional flexibility, users can specify an ensemble of the off-the-shelf classifiers offered or provide a custom toxicity classifier object.

### Stereotype Metrics
To measure stereotypes in LLM responses, the `StereotypeMetrics` class offers two categories of metrics: metrics based on word cooccurrences and metrics that leverage a pre-trained stereotype classifier. Metrics based on word cooccurrences aim to assess relative cooccurrence of stereotypical words with certain protected attribute words. On the other hand, stereotype-classifier-based metrics leverage the `wu981526092/Sentence-Level-Stereotype-Detector` classifier available on HuggingFace [@zekun2023auditinglargelanguagemodels] and compute analogs of the aforementioned toxicity-classifier-based metrics [@bouchard2024actionableframeworkassessingbias].^[https://huggingface.co/wu981526092/Sentence-Level-Stereotype-Detector]
Expand Down

0 comments on commit 839f2ae

Please sign in to comment.