Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency

Studying the Role of Input-Neighbor Overlap in Retrieval-Augmented Language Models Training Efficiency

ACL ARR 2025 May Submission2664 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Retrieval-augmented language models have demonstrated performance comparable to much larger models while requiring fewer computational resources. The effectiveness of these models crucially depends on the overlap between query and retrieved context, but the optimal degree of this overlap remains unexplored. In this paper, we systematically investigate how varying levels of query--context overlap affect model performance during both training and inference. Our experiments reveal that increased overlap initially has minimal effect, but substantially improves test-time perplexity and accelerates model learning above a critical threshold. Building on these findings, we demonstrate that deliberately increasing overlap through synthetic context can enhance data efficiency and reduce training time by approximately 40% without compromising performance. We specifically generate synthetic context through paraphrasing queries. We validate our perplexity-based findings on question-answering tasks, confirming that the benefits of retrieval-augmented language modeling extend to practical applications. Our results provide empirical evidence of significant optimization potential for retrieval mechanisms in language model pretraining.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: retrieval-augmented language models, rag, retro, efficient, low-resource

Contribution Types: Approaches to low-resource settings

Languages Studied: English

Submission Number: 2664

Loading