Keywords: Language Agent; Search Agent; DeepResearch
Abstract: Language agents are increasingly deployed for web search, yet most benchmarks assume queries are fully specified and unambiguous. In practice, user queries are often incomplete and require clarification before accurate answers can be produced. To systematically evaluate this overlooked capability, we introduce INTERACTCOMP, a benchmark explicitly designed to evaluate whether agents can recognize and resolve such ambiguity by deciding when to search, when to ask clarifying questions, and when to answer. INTERACTCOMP contains 210 expert-curated questions spanning 9 domains, constructed through a systematic target-distractor methodology that ensures genuine ambiguity and controlled disambiguation. Extensive experiments on 17 models reveal striking behavioral patterns: even state-of-the-art models achieve less than 14% accuracy, not because they lack reasoning ability, but because they exhibit systematic overconfidence and underutilize interaction opportunities. Ablation and forced-interaction analyses confirm this bottleneck: when compelled to interact, models achieve significant performance gains, demonstrating latent capacity that current strategies fail to unlock. A longitudinal study further highlights a blind spot in model development, while retrieval benchmarks show rapid improvement, interactive capabilities remain stagnant. By exposing this overlooked weakness, InteractComp provides not only a diagnostic tool but also a foundation for designing agents that are uncertainty-aware, strategically interactive, and aligned with real-world user behavior.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 2826
Loading