Withdraw
Loading…
Benchmarking stereotype bias and toxicity in large language models
Dutta, Ritik
Loading…
Permalink
https://hdl.handle.net/2142/124243
Description
- Title
- Benchmarking stereotype bias and toxicity in large language models
- Author(s)
- Dutta, Ritik
- Issue Date
- 2024-04-10
- Director of Research (if dissertation) or Advisor (if thesis)
- Li, Bo
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- AI
- Benchmarks
- Large Language Models
- LLMs
- Abstract
- Large language models (LLMs) powered by the transformer architecture have displayed tremendous advancements in the field of natural language processing (NLP). LLMs pre-trained on large volumes of internet data have been shown to demonstrate near state of the art capabilities on many downstream NLP tasks, which can be further augmented with additional finetuning on task-specific data. Many practioners have been deploying LLMs in customer facing chatbots as well as utilizing them in sensitive application areas such as healthcare (determining insurance policy parameters, assessing claims, etc.), finance (determining loan interest rates, etc.), where machine learning systems could harm marginalized demographic groups by displaying bias. The most prominent characteristic of LLMs is their instruction-following capabilities and their ability to generate coherent text. Instructions are provided to LLMs in plain-text in the form of system prompts, and can be used to “jailbreak” content policy restrictions that may have been put in place by the model trainers. In this work, we present a new benchmark that is meant to assess the trustworthiness of LLMs with a specific emphasis on stereotype bias and toxicity. The benchmark covers stereotypes against 12 demographic groups varying across 7 different demographic factors: race/ethnicity (Asians, Black people, etc.), gender/sexual orientation (men, women, etc.), nationality (Mexicans, Americans, etc.), age (old and young people), religion (Muslims, Chris- tians, etc.), disability (physically-disabled and able-bodied people), and socioeconomic status (poor and rich people). It also extends existing toxicity benchmarks by including adversarial prompts that enable a comprehensive assessment of LLM toxicity under conversational-style settings. Using our benchmark, we find that it is trivial to overcome the restrictions set in place by model trainers and to induce LLMs to produce harmful outputs. GPT-3.5, GPT-4 and open-source models such as LLaMA can be “tricked” into outputting toxic content by including adversarial instructions in the system prompt. With this benchmark we aim to assist model trainers and application developers in testing the robustness of LLMs against harmful content generation before deployment in consumer-facing applications.
- Graduation Semester
- 2024-05
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2024 Ritik Dutta
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…