StarDTox: An Agent-based Framework for Bias and Toxicity Mitigation in Large Language Models

StarDTox: An Agent-based Framework for Bias and Toxicity Mitigation in Large Language Models

ACL ARR 2025 February Submission7613 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but often reflect harmful biases and toxic behavior, risking marginalized communities and trust in these systems. Existing mitigation methods, from pre- to post-processing, struggle with scalability, efficiency, and adaptability. To address these challenges, we present StarDTox, an agent-based framework that iteratively refines LLM outputs using task-specific feedback. Operating primarily as a post-processing solution with intra-processing elements, StarDTox reduces bias and toxicity without requiring model weights or fine-tuning. Evaluations on sentence completion and multiple-choice tasks demonstrate significant reductions in representational and allocational harms while ensuring efficiency and adaptability.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: Bias Mitigation,

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 7613

Loading