Parameter vs. Test-Time Scaling in LLMs: FLOPs-Aware, Cross-Domain, Domain-Dependent, Pareto-Optimal Compute Allocation
Keywords: Parameter scaling, Test-time scaling, Chain-of-Thought, Internal reasoning, External reasoning, Cost–accuracy trade-off, Large language models, Redundancy principle, FLOPs-aware analysis, Pareto frontier
TL;DR: cost-aware cross-domain study comparing parameter vs test-time scaling in LLMs; quantifies internal–external reasoning redundancy and provides cost-accuracy Pareto frontiers to guide economical high-performance deployment.
Abstract: We study how to allocate compute between model size and test-time scaling (inference-time reasoning) to achieve cost-effective accuracy in large language models. We introduce a controllable-reasoning experimental design that directly compares parameter scaling and test-time scaling on mathematical reasoning (GSM8K) and knowledge retrieval (PopQA), using rigorous FLOPs and cost accounting and Gemini’s thinking\_budget to disentangle internal from external Chain-of-Thought (CoT) reasoning. Results show strong domain dependence. On GSM8K, internal reasoning alone reaches 95.36\% accuracy at $\$3.8\times10^{-5}$ per sample, while CoT compensates for disabled internal reasoning to 95.60\% at $\$9.4\times10^{-4}$, indicating near-perfect substitutability between internal and external mechanisms. On PopQA, external CoT often reduces both accuracy and cost-efficiency, with optimal settings consistently favoring direct generation over extended reasoning chains. We contribute: (1) the redundancy principle quantifying overlap between internal and external reasoning; (2) FLOPs-aware, domain-specific cost–accuracy Pareto frontiers that reveal distinct optimization strategies; and (3) actionable deployment policies that align test-time scaling with task characteristics and model architectures, providing evidence-based guidance for economical, high-performance LLM deployment.
Supplementary Material: zip
Submission Number: 194
Loading