Keywords: Agentic Search, Evaluation, Large Language Models
Abstract: Agentic search, as a more autonomous and adaptive paradigm of retrieval augmentation, is driving the evolution of intelligent search systems. However, existing evaluation frameworks fail to align well with the real goals of agentic search. First, existing evaluation queries with complex queries and short-form answers often deviate from realistic user search scenarios. Second, most evaluations focus solely on the end-to-end performance, neglecting assessment of iterative process inherent to agentic search. To address these limitations, we propose RAVine---a Reality-Aligned eValuation framework for agentic LLMs with search. RAVine targets real user queries which need multi-faceted search and long-form answers. And we introduce an attributable nuggets construction strategy to enhance long-form evaluation precision and consistency. Moreover, RAVine examines models with process-oriented metrics, including search tool performance and efficiency. We benchmark a series of models using RAVine and derive several insights, which we hope will contribute to advancing the development of agentic search systems.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, evaluation methodologies, evaluation, metrics
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 8465
Loading