Abstract: Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but often reflect harmful biases and toxic behavior, risking marginalized communities and trust in these systems. Existing mitigation methods, from pre- to post-processing, struggle with scalability, efficiency, and adaptability. To address these challenges, we present StarDTox, an agent-based framework that iteratively refines LLM outputs using task-specific feedback. Operating primarily as a post-processing solution with intra-processing elements, StarDTox reduces bias and toxicity without requiring model weights or fine-tuning. Evaluations on sentence completion and multiple-choice tasks demonstrate significant reductions in representational and allocational harms while ensuring efficiency and adaptability.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Bias Mitigation,
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 7613
Loading