Demystifying Delays in Reasoning: A Pilot Temporal and Token Analysis of Reasoning Systems

Qi Qi; Reyna Abhyankar; Yiying Zhang

Demystifying Delays in Reasoning: A Pilot Temporal and Token Analysis of Reasoning Systems

Qi Qi, Reyna Abhyankar, Yiying Zhang

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reasoning language models, Deep research, Agentic systems, Augmented language models

TL;DR: We study the performance of reasoning models on deep research tasks and find that web-search can dominate end-to-end latency.

Abstract: Despite rapid gains in accuracy, the latency of reasoning and deep-research systems has been largely overlooked. Reasoning models augmented with external tools have demonstrated strong abilities in solving complex tasks. We present the first systematic temporal study of three representative reasoning models and agents, OpenAI o3, GPT-5, and the LangChain Deep Research Agent on DeepResearch Bench. By instrumenting each system, we decompose end-to-end request latency and token costs across reasoning, web search, and answer generation. We find that web search often dominates end-to-end request latency and that final answer generation consumes most tokens due to the lengthy retrieved context, implying that tool latency and retrieval design are primary levers for speeding up reasoning.

Submission Number: 166

Loading