Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling

Youpeng Zhao; Jinpeng Lv; Di Wu; Jun Wang

Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling

Youpeng Zhao, Jinpeng Lv, Di Wu, Jun Wang

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Test-time Scaling, System Evaluation, LLMs

TL;DR: We propose to investigate test-time scaling (TTS) from system perspective in terms of latency and cost-per-token

Abstract: Test-time scaling (TTS) has recently emerged as a promising direction to exploit the hidden reasoning capabilities of pre-trained large language models (LLMs). However, existing scaling methods narrowly focus on the compute-optimal Pareto-frontier, ignoring the simple fact that *compute-optimal is not always system-optimal*. In this work, we propose a system-driven perspective on TTS, analyzing how reasoning models scale against practical metrics, such as latency and cost-per-token. By evaluating the impact of popular optimizations such as tensor parallelism and speculative decoding, our preliminary analysis reveals the limitations of current methods and calls for a paradigm shift toward holistic, system-aware evaluations that capture the true essence of scaling laws at inference time.

Submission Number: 73

Loading