Keywords: Inference-time Intervention, Representation Analysis, Linear Probe
Abstract: Large Language Models (LLMs) often fail on multiple-choice questions (MCQs) despite demonstrating correct knowledge in other contexts, such as free-form generation. To investigate the mechanism underlying this knowledge-prediction gap and alleviate it, we conduct a probing analysis on binary-choice questions and find that residual streams in certain layers contain a subspace spanned by two important bases: a knowledge basis that encodes the probability of the ground-truth answer and a prediction basis that encodes the probability of the answer choice predicted by the model.
We observe that incorrect predictions arise from a misalignment of the model's hidden states along these two bases.
Hence, we introduce KAPPA (Knowledge-Aligned Prediction through Projection-based Adjustment), an inference-time intervention that transforms hidden states to align the prediction coordinate with the knowledge coordinate.
Experiments on binary-choice reformulations of Big-Bench-Hard show that KAPPA substantially improves accuracy and consistently outperforms baselines.
KAPPA's benefit further extends to general MCQs, precisely mitigating the knowledge-predictino gap.
Our work provides a new geometric understanding of the knowledge-prediction gap and offers a practical method for better aligning model behavior with its latent knowledge.
Primary Area: interpretability and explainable AI
Submission Number: 6660
Loading