BIAS DIFF: Bias Data Attribution with Influence Function

BIAS DIFF: Bias Data Attribution with Influence Function

ACL ARR 2025 May Submission3297 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The presence of bias in Large Language Models poses a major obstacle to trustworthy AI, as it heightens the risk of adversarial attacks and misuse in real-world scenarios. However, existing debiasing methods often suffer from low efficiency, lack theoretical guarantees of effectiveness, or compromise the model’s core capabilities. To address these challenges, we propose BIAS DIFF (Bias Data Attribution with Influence Function), a novel model interpretability-based debiasing framework. BIAS DIFF first identifies biased data using influence functions. Then applies targeted debiasing strategies tailored to different settings. Experiments on Qwen2.5-1.5B-Instruct and opt-1.3b show that our method was able to extract over 99.5\% of the biased samples using 35\% of training data. It also achieved at least a 28\% reduction in bias on CrowS-Pairs test set. Our code is publicly available at \href{https://anonymous.4open.science/r/parhelic-tmo/}{https://anonymous.4open.science/r/parhelic-tmo/}.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: model bias/unfairness mitigation, data influence

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 3297

Loading