finalize references

cvs-health · Dec 3, 2024 · 6d8e45b · 6d8e45b
1 parent 624f975
commit 6d8e45b
Show file tree

Hide file tree

Showing 2 changed files with 8 additions and 1 deletion.
diff --git a/paper/paper.bib b/paper/paper.bib
@@ -554,3 +554,10 @@ @inproceedings{delobelle-etal-2022-measuring
     pages = "1693--1706",
     abstract = "An increasing awareness of biased patterns in natural language processing resources such as BERT has motivated many metrics to quantify {`}bias{'} and {`}fairness{'} in these resources. However, comparing the results of different metrics and the works that evaluate with such metrics remains difficult, if not outright impossible. We survey the literature on fairness metrics for pre-trained language models and experimentally evaluate compatibility, including both biases in language models and in their downstream tasks. We do this by combining traditional literature survey, correlation analysis and empirical evaluations. We find that many metrics are not compatible with each other and highly depend on (i) templates, (ii) attribute and target seeds and (iii) the choice of embeddings. We also see no tangible evidence of intrinsic bias relating to extrinsic bias. These results indicate that fairness or bias evaluation remains challenging for contextualized language models, among other reasons because these choices remain subjective. To improve future comparisons and fairness evaluations, we recommend to avoid embedding-based metrics and focus on fairness evaluations in downstream tasks.",
 }
+
+% Evaluate
+@misc{huggingface-no-date,
+	author = {Huggingface},
+	title = {GitHub - huggingface/evaluate: Evaluate: A library for easily evaluating machine learning models and datasets.},
+	url = {https://github.com/huggingface/evaluate},
+}
diff --git a/paper/paper.md b/paper/paper.md
@@ -37,7 +37,7 @@ Traditional machine learning (ML) fairness toolkits like AIF360 [@aif360-oct-201
 
 LLMs are used in systems that solve tasks such as recommendation, classification, text generation, and summarization. In practice, these systems try to restrict the responses of the LLM to the task at hand, often by including task-specific instructions in system or user prompts. When the LLM is evaluated without taking the set of task-specific prompts into account, the evaluation metrics are not representative of the system's true performance. Representing the system's actual performance is especially important when evaluating its outputs for bias and fairness risks because they pose real harm to the user and, by way of repercussions, the system developer.
 
-Most evaluation tools, including those that assess bias and fairness risk, evaluate LLMs at the model-level by calculating metrics based on the responses of the LLMs to static benchmark datasets of prompts [@rudinger-EtAl:2018:N18; @zhao-2018; @webster-etal-2018-mind; @levy2021collecting; @nadeem2020stereoset; @bartl2020unmasking; @nangia2020crows; @felkner2024winoqueercommunityintheloopbenchmarkantilgbtq; @barikeri2021redditbiasrealworldresourcebias; @kiritchenko2018examininggenderracebias; @qian2022perturbationaugmentationfairernlp; @Gehman2020RealToxicityPromptsEN; @bold_2021; @huang2023trustgptbenchmarktrustworthyresponsible; @nozza-etal-2021-honest; @parrish-etal-2022-bbq; @li-etal-2020-unqovering; @10.1145/3576840.3578295] that do not consider prompt-specific risks and are often independent of the task at hand. Holistic Evaluation of Language Models (HELM) [@liang2023holisticevaluationlanguagemodels], DecodingTrust [@wang2023decodingtrust], and several other toolkits [@srivastava2023beyond; @huang2024trustllm; @eval-harness; Arshaan_Nazir_and_Thadaka_Kalyan_Chakravarthy_and_David_Amore_Cecchini_and_Thadaka_Kalyan_Chakravarthy_and_Rakshit_Khajuria_and_Prikshit_Sharma_and_Ali_Tarik_Mirik_and_Veysel_Kocaman_and_David_Talby_LangTest_A_comprehensive_2024] follow this paradigm. 
+Most evaluation tools, including those that assess bias and fairness risk, evaluate LLMs at the model-level by calculating metrics based on the responses of the LLMs to static benchmark datasets of prompts [@rudinger-EtAl:2018:N18; @zhao-2018; @webster-etal-2018-mind; @levy2021collecting; @nadeem2020stereoset; @bartl2020unmasking; @nangia2020crows; @felkner2024winoqueercommunityintheloopbenchmarkantilgbtq; @barikeri2021redditbiasrealworldresourcebias; @kiritchenko2018examininggenderracebias; @qian2022perturbationaugmentationfairernlp; @Gehman2020RealToxicityPromptsEN; @bold_2021; @huang2023trustgptbenchmarktrustworthyresponsible; @nozza-etal-2021-honest; @parrish-etal-2022-bbq; @li-etal-2020-unqovering; @10.1145/3576840.3578295] that do not consider prompt-specific risks and are often independent of the task at hand. Holistic Evaluation of Language Models (HELM) [@liang2023holisticevaluationlanguagemodels], DecodingTrust [@wang2023decodingtrust], and several other toolkits [@srivastava2023beyond; @huang2024trustllm; @eval-harness; @Arshaan_Nazir_and_Thadaka_Kalyan_Chakravarthy_and_David_Amore_Cecchini_and_Thadaka_Kalyan_Chakravarthy_and_Rakshit_Khajuria_and_Prikshit_Sharma_and_Ali_Tarik_Mirik_and_Veysel_Kocaman_and_David_Talby_LangTest_A_comprehensive_2024; @huggingface-no-date] follow this paradigm. 
 
 LangFair complements the aforementioned frameworks because it follows a bring your own prompts (BYOP) approach, which allows users to tailor the bias and fairness evaluation to their use case by computing metrics using LLM responses to user-provided prompts. This addresses the need for a task-based bias and fairness evaluation tool that accounts for prompt-specific risk for LLMs.^[Experiments in [@wang2023decodingtrust] demonstrate that prompt content has substantial influence on the likelihood of biased LLM responses.]