Listening to the Wise Few: Query–Key Alignment Unlocks Latent Correct Answers in Large Language Models
Keywords: large language models, attention mechanisms, interpretability
TL;DR: Query–Key score reveals “select-and-copy” heads whose strongest Q–K match pinpoints the right choice: boosts MCQA accuracy up to +16 pp (60 pp synthetic) across diverse LLMs—training-free, interpretable, model-agnostic.
Abstract: Language models often struggle with multiple-choice question answering (MCQA) tasks when they fail to consistently choose the correct letter corresponding to the right answer. We find that while certain attention heads within large language models (LLMs) identify the right answer internally, this information can be lost before the final decision stage of what to output. To demonstrate and measure this effect, we introduce the QK-score, a metric based on query- and key-vectors alignment, that retrieves the correct answer directly from individual attention heads. This allows us to identify "select-and-copy" heads, that consistently focus on the correct option during inference. Across four standard MCQA benchmarks (MMLU, CosmosQA, HellaSwag, and HaluDialogue), QK-score from such heads can be better than the model’s own output by up to 16% in terms of accuracy, especially for smaller models. On a synthetic dataset, these heads may outperform the baseline by as much as 60%. We also find that QK-scores from "select-and-copy" heads are robust to option permutations and remain effective in few-shot settings. After analyzing a wide range of models across the LLaMA, Qwen, and other families, with 1.5B to 70B parameters, we observe the select-and-copy phenomenon in all of them. Our findings offer new insights into the inner workings of LLMs and open a principled path toward head-level interventions for controllable and trustworthy LLM reasoning.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 17883
Loading