Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency
Abstract: Retrieval-augmented language models have demonstrated performance comparable to much larger models while requiring fewer computational resources. The effectiveness of these models crucially depends on the overlap between query and retrieved context, but the optimal degree of this overlap remains unexplored. In this paper, we systematically investigate how varying levels of query--context overlap affect model performance during both training and inference. Our experiments reveal that increased overlap initially has minimal effect, but substantially improves test-time perplexity and accelerates model learning above a critical threshold. Building on these findings, we demonstrate that deliberately increasing overlap through synthetic context can enhance data efficiency and reduce training time by approximately 40% without compromising performance. We specifically generate synthetic context through paraphrasing queries. We validate our perplexity-based findings on question-answering tasks, confirming that the benefits of retrieval-augmented language modeling extend to practical applications. Our results provide empirical evidence of significant optimization potential for retrieval mechanisms in language model pretraining.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: retrieval-augmented language models, rag, retro, efficient, low-resource
Contribution Types: Approaches to low-resource settings
Languages Studied: English
Submission Number: 2664
Loading