Unveiling the Magic: Investigating Attention Distillation in Retrieval-Augmented Generation

Zizhong Li, Haopeng Zhang, Jiawei Zhang

Published: 2024, Last Modified: 17 Jul 2025NAACL (Short Papers) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Retrieval-augmented generation framework addresses the limitations of large language models by enabling real-time knowledge updates for more accurate answers. An efficient way in the training phase of retrieval-augmented models is attention distillation, which uses attention scores as supervision signals instead of manually annotated query-document pairs. Despite its growing popularity, the detailed mechanisms behind the success of attention distillation remain unexplored, particularly the specific patterns it leverages to benefit training. In this paper, we address this gap by conducting a comprehensive investigation of attention distillation workflow and identifying key factors influencing the learning performance of retrieval-augmented language models. We further propose several insightful indicators for optimizing models’ training methods and avoiding ineffective training.