Abstract: The presence of bias in Large Language Models poses a major obstacle to trustworthy AI, as it heightens the risk of adversarial attacks and misuse in real-world scenarios. However, existing debiasing methods often suffer from low efficiency, lack theoretical guarantees of effectiveness, or compromise the model’s core capabilities. To address these challenges, we propose BIAS DIFF (Bias Data Attribution with Influence Function), a novel model interpretability-based debiasing framework. BIAS DIFF first identifies biased data using influence functions. Then applies targeted debiasing strategies tailored to different settings. Experiments on Qwen2.5-1.5B-Instruct and opt-1.3b show that our method was able to extract over 99.5\% of the biased samples using 35\% of training data. It also achieved at least a 28\% reduction in bias on CrowS-Pairs test set. Our code is publicly available at \href{https://anonymous.4open.science/r/parhelic-tmo/}{https://anonymous.4open.science/r/parhelic-tmo/}.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/unfairness mitigation, data influence
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 3297
Loading