Giving Control Back to Models: Enabling Offensive Language Detection Models to Autonomously Identify and Mitigate Biases
Abstract: The rapid development of social media has led to an increase in online harassment and offensive speech, posing significant challenges for effective content moderation. Existing automated detection models often exhibit a bias towards predicting offensive speech based on specific vocabulary, which not only compromises model fairness but also potentially exacerbates biases against vulnerable and minority groups. Addressing these issues, this paper proposes a bias self-awareness and data self-iteration framework for mitigating model biases.This framework aims to "giving control back to models: enabling offensive language detection models to autonomously identify and mitigate biases" through bias self-awareness algorithms and self-iterative data augmentation method. Experimental results demonstrate that the proposed framework effectively reduces the false positive rate of models in both in-distribution and out-of-distribution tests, enhances model accuracy and fairness, and shows promising performance improvements in detecting offensive speech on larger-scale datasets.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Model bias correction, offensive language detection, Bias Self-Awareness, spurious artifacts
Contribution Types: NLP engineering experiment
Languages Studied: Chinese
Submission Number: 4926
Loading