CUB: Benchmarking Context Utilisation Techniques for Language Models

ACL ARR 2025 July Submission608 Authors

28 Jul 2025 (modified: 02 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Incorporating external knowledge is crucial for knowledge-intensive tasks, such as question answering and fact checking. However, language models (LMs) may ignore relevant information that contradicts outdated parametric memory or be distracted by irrelevant contexts. While many context utilisation manipulation techniques (CMTs) have recently been proposed to alleviate these issues, few have seen systematic comparison. In this paper, we develop CUB (Context Utilisation Benchmark) - the first comprehensive benchmark designed to help practitioners within retrieval-augmented generation (RAG) diagnose CMTs under different context conditions. With this benchmark, we conduct the most extensive evaluation to date of seven state-of-the-art methods, representative of the main categories of CMTs, across three diverse datasets and tasks, applied to nine LMs. Our results reveal that most existing CMTs struggle to handle the full spectrum of context types encountered in real-world retrieval-augmented scenarios. We also find that many CMTs display inflated performance on simple synthesised datasets, compared to more realistic datasets with naturally occurring samples. Our findings expose critical gaps in current CMT evaluation practices and demonstrate the need for holistic testing and the development of CMTs that can robustly handle multiple context types.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: hardness of samples, knowledge inducing, model editing
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 608
Loading