STARDTOX: Is Fairness in Language Models Just a Few Prompts Away?

ACL ARR 2025 May Submission4452 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) often produce outputs that reflect social biases, toxicity, or unfair treatment of demographic groups, undermining trust and fairness. While prior mitigation strategies frequently rely on complex architectures, access to model internals, or costly fine-tuning, we argue that simplicity can be a strength. We introduce STARDTOX, a lightweight, critique-and-revise multi-agent framework that leverages the LLM’s own internal knowledge, via a small number of coordinated prompts, to self-correct harmful outputs. Dedicated agents independently assess bias and overall output quality, and their feedback is integrated to guide prompt-based revision. Without modifying model weights or requiring any extra fine-tuning, STARDTOX offers strong bias mitigation and high-quality outputs across both open-ended text generation and structured tasks, outperforming other baselines. For the text generation task, on the RealToxicityPrompt dataset, it reduces toxicity by over 50% compared to other baselines, while maintaining over 90% fluency. In addition, in structured tasks, on the BBQ benchmark, it achieves the lowest bias scores across both ambiguous and disambiguated examples, without sacrificing accuracy.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: NLP, Fairness, Agent
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Keywords: LLM, Fairness, Agents
Submission Number: 4452
Loading