Explaining News Bias Detection: A Comparative SHAP Analysis of Transformer Model Decision Mechanisms

ACL ARR 2026 January Submission2839 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: news bias detection; model interpretability; SHAP analysis; transformer models; linguistic bias
Abstract: Automated bias detection in news text is heavily used to support journalistic analysis and media accountability, yet little is known about how bias detection models arrive at their decisions or why they fail. In this work, we present a comparative interpretability study of two transformer-based bias detection models: a bias-detector model and a domain-adapted Robustly optimized BERT pretraining approach (RoBERTa) model (DA-RoBERTa- BABE-FT), both fine-tuned on the Bias Annotation for Bias Experiments (BABE) dataset, using SHapley Additive exPlanations (SHAP)-based explanations. We analyze word-level attributions across correct and incorrect predictions to characterize how different model architectures operationalize linguistic bias. Our results show that although both models attend to similar categories of evaluative language, they differ substantially in how these signals are integrated into predictions. The bias detector model assigns stronger internal evidence to false positives than to true positives, indicating a misalignment between attribution strength and prediction correctness and contributing to systematic over-flagging of neutral journalistic content. In contrast, DA-RoBERTa-BABE-FT exhibits attribution patterns that better align with prediction outcomes and produces 63\% fewer false positives. We further demonstrate that model errors arise from distinct linguistic mechanisms, with false positives driven by discourse-level ambiguity rather than explicit bias cues. These findings highlight the importance of interpretability-aware evaluation for bias detection systems and suggest that architectural and training choices critically affect both model reliability and deployment suitability in journalistic contexts.
Paper Type: Long
Research Area: Sentiment Analysis, Stylistic Analysis, and Argument Mining
Research Area Keywords: feature attribution, data shortcuts/artifacts, explanation faithfulness, robustness, hardness of samples
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 2839
Loading