HawkBench: Investigating Resilience of RAG Methods on Stratified Information-Seeking Tasks

Hongjin Qian; Zheng Liu; Chao Gao; Yankai Wang; Defu Lian; Zhicheng Dou

HawkBench: Investigating Resilience of RAG Methods on Stratified Information-Seeking Tasks

Hongjin Qian, Zheng Liu, Chao Gao, Yankai Wang, Defu Lian, Zhicheng Dou

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: retrieval-augmented generation, benchmark, knowledge-intensive task

TL;DR: HawkBench is a human-labeled, multi-domain benchmark with 1,600 samples for evaluating RAG systems on diverse queries, revealing limits in generalizability and the need for adaptive strategies.

Abstract: In real-world information-seeking scenarios, users have dynamic and diverse needs, requiring RAG systems to demonstrate adaptable resilience. To comprehensively evaluate the resilience of current RAG methods, we introduce HawkBench, a human-labeled, multi-domain benchmark designed to rigorously assess RAG performance across categorized task types. By stratifying tasks based on information-seeking behaviors, HawkBench provides a systematic evaluation of how well RAG systems adapt to diverse user needs. Unlike existing benchmarks, which focus primarily on specific task types (mostly factoid queries) and rely on varying knowledge bases, HawkBench offers: (1) systematic task stratification to cover a broad range of query types, including both factoid and rationale queries, (2) integration of multi-domain corpora across all task types to mitigate corpus bias, and (3) rigorous annotation for high-quality evaluation. HawkBench includes 1,600 high-quality test samples, evenly distributed across domains and task types. Using this benchmark, we evaluate representative RAG methods, analyzing their performance in terms of answer quality and response latency. Our findings highlight the need for dynamic task strategies that integrate decision-making, query interpretation, and global knowledge understanding to improve RAG generalizability. We believe HawkBench serves as a pivotal benchmark for advancing the resilience of RAG methods and their ability to achieve general-purpose information seeking.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/TommyChien/HawkBench

Code URL: https://github.com/qhjqhj00/HawkBench

Supplementary Material: zip

Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling

Flagged For Ethics Review: true

Submission Number: 2259

Loading