Can Global XAI Methods Reveal Injected Bias in LLMs? SHAP vs Rule Extraction vs RuleSHAP

Can Global XAI Methods Reveal Injected Bias in LLMs? SHAP vs Rule Extraction vs RuleSHAP

ICLR 2026 Conference Submission19227 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Rule-based interpretability, Global model explainability, SHAP-based rule induction, Misinformation drivers (framing/overload/oversimplification)

TL;DR: The paper benchmarks global rule extraction XAI algorithms for detecting bias-inducing heuristics in LLMs, and introduces RuleSHAP (a hybrid of SHAP and RuleFit) that significantly improves detection of nonlinear biases from LLM beliefs.

Abstract: Large language models (LLMs) can amplify misinformation, undermining societal goals like the UN SDGs. We study three documented drivers of misinformation ($\textit{valence framing}$, $\textit{information overload}$, and $\textit{oversimplification}$) which are often shaped by one's default beliefs. Building on evidence that LLMs encode such defaults (e.g., “joy is positive,” “math is complex”) and can act as “bags of heuristics,” we ask: can general belief-driven heuristics behind misinformative behaviour be recovered from LLMs as clear rules? A key obstacle is that global rule-extraction methods in explainable AI (XAI) are built for numerical inputs/outputs, not text. We address this by eliciting global LLM beliefs and mapping them to numerical scores via statistically reliable abstractions, thereby enabling off-the-shelf global XAI to detect belief-related heuristics in LLMs. To obtain ground truth, we hard-code bias-inducing nonlinear heuristics of increasing complexity (univariate, conjunctive, nonconvex) into popular LLMs (ChatGPT and Llama) via system instructions. This way, we find that $\textit{RuleFit}$ under-detects non-univariate biases, while $\textit{global SHAP}$ better approximates conjunctive ones but does not yield actionable rules. To bridge this gap, we propose $\textit{RuleSHAP}$, a rule-extraction algorithm that couples global SHAP-value aggregations with rule induction to better capture non-univariate bias, improving heuristics detection over RuleFit by +94% (MRR@1) on average. Our results provide a practical pathway for revealing belief-driven biases in LLMs.

Primary Area: interpretability and explainable AI

Submission Number: 19227

Loading