Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

Shixing Yu; PROMIT GHOSAL; Kyra Gan

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

Shixing Yu, PROMIT GHOSAL, Kyra Gan

16 Sept 2025 (modified: 18 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Influence Functions, Sparse Autoencoder, Interpretability, LLM

Abstract: A critical step for reliable large language models (LLMs) use in healthcare is to at- tribute predictions to their training data, akin to a medical case study. This requires token-level precision: pinpointing not just which training examples influence a de- cision, but which tokens within them are responsible. While influence functions offer a principled framework for this, prior work is restricted to autoregressive settings and relies on an implicit assumption of token independence, rendering their identified influences unreliable. We introduce a flexible framework that in- fers token-level influence through a latent mediation approach for general predic- tion tasks. Our method attaches sparse autoencoders to any layer of a pretrained LLM to learn a basis of approximately independent latent features. Unlike prior methods where influence decomposes additively across tokens, influence com- puted over latent features is inherently non-decomposable. To address this, we introduce a novel method using Jacobian-vector products. Token-level influence is obtained by propagating latent attributions back to the input space via token activation patterns. We scale our approach using efficient inverse-Hessian ap- proximations. Experiments on medical benchmarks show our approach identifies sparse, interpretable sets of tokens that jointly influence predictions. Our frame- work enhances trust and enables model auditing, generalizing to any high-stakes domain requiring transparent and accountable decisions

Primary Area: interpretability and explainable AI

Submission Number: 6865

Loading