Keywords: Reasoning language models, Deep research, Agentic systems, Augmented language models
TL;DR: We study the performance of reasoning models on deep research tasks and find that web-search can dominate end-to-end latency.
Abstract: Despite rapid gains in accuracy, the latency of reasoning and deep-research systems has been largely overlooked. Reasoning models augmented with external tools have demonstrated strong abilities in solving complex tasks. We present the first systematic temporal study of three representative reasoning models and agents, OpenAI o3, GPT-5, and the LangChain Deep Research Agent on DeepResearch Bench. By instrumenting each system, we decompose end-to-end request latency and token costs across reasoning, web search, and answer generation. We find that web search often dominates end-to-end request latency and that final answer generation consumes most tokens due to the lengthy retrieved context, implying that tool latency and retrieval design are primary levers for speeding up reasoning.
Submission Number: 166
Loading