Internal Activations Reveal LLMs’ Context Faith: A Linear Direction for Context Distrust Control

Internal Activations Reveal LLMs’ Context Faith: A Linear Direction for Context Distrust Control

ACL ARR 2026 January Submission8222 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Context Distrust, Difference-in-means, Internal Activations

Abstract: Retrieval-Augmented Generation (RAG) has been widely adopted to enhance LLMs' factuality and performance. However, the mechanisms behind whether LLMs have faith in the context remain poorly understood. In this work, we show that LLMs' internal activations encode explicit awareness about context relevance, and this awareness can be extracted and manipulated to control context utilization. We find that context distrust can be mediated by a single direction. This direction can intervene in the model to reduce context faith during generation. We demonstrate the effectiveness of this direction on the FaithEval benchmark, showing substantial improvements on all tasks across three open-source chat models. Through analysis of attention patterns and generation uncertainty, we reveal how the context distrust direction affects the model's information processing, including reduced attention to context tokens and increased generation entropy. Based on these findings, we propose AACR, a method that leverages both internal activation-based context confidence and verbalized parametric knowledge confidence to dynamically route between external context and internal knowledge, achieving improved robustness on noisy retrieval scenarios while maintaining performance on relevant context.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Generation,Language Modeling,Interpretability and Analysis of Models for NLP

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 8222

Loading