Search Arena: Analyzing Search-Augmented LLMs

ICLR 2026 Conference Submission20098 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Web Search, Human-AI Interaction
TL;DR: We introduce the first large-scale human-preference dataset of user interactions with search-augmented LLMs and provide in-depth analyses.
Abstract: Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce \textbf{Search Arena}, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations and types of cited sources, even when the cited content does not directly support the associated claims, uncovering a gap between perceived and actual credibility. To assess cross-setting performance, we conduct cross-arena analyses by testing search-augmented LLMs in a general purpose chat environment and conventional LLMs in search-heavy settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge. We open-sourced the dataset to support future research.
Primary Area: datasets and benchmarks
Submission Number: 20098
Loading