ChatGPT Doesn’t Trust LA Chargers Fans: Guardrail Sensitivity in Context

ACL ARR 2024 June Submission3484 Authors

16 Jun 2024 (modified: 08 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While the biases of language models in production are extensively documented, the biases of their guardrails themselves have been neglected. This paper studies how contextual information about the user influences the likelihood of an LLM to refuse to execute a request. By generating user biographies that offer ideological and demographic information, we find a number of biases in guardrail sensitivity on ChatGPT-3.5. Younger, female, White, and Asian-American personas were more likely to trigger a refusal guardrail when requesting censored or illegal information. Guardrails are also sycophantic, leading to refusals to comply with requests for a political position the user might disagree with. Furthermore, we find that certain identity groups and even seemingly innocuous user information like sports fandom can elicit changes in guardrail sensitivity similar to overt political endorsement. For each demographic category and even for National Football League (NFL) team fandom declarations, we find that ChatGPT seemingly infers a likely political ideology and modifies guardrail behavior accordingly.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: model bias/fairness evaluation, bias/toxicity, conversational modeling, human-computer interaction
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 3484
Loading