Disentangling Test-Time and Parameter Scaling for Cost-Efficient Accuracy Improvements in Agentic Evaluation
Keywords: Agentic evaluation, Large language models (LLMs), Test-time scaling, Parameter scaling, Chain-of-Thought (CoT), Internal reasoning, Cost-efficiency, Pareto frontier, Knowledge retrieval, Mathematical reasoning
TL;DR: Compare CoT vs parameter scaling on accuracy/cost/latency: CoT aids small models on math; capacity wins on knowledge. Internal reasoning can render CoT redundant. We report Pareto fronts and cost-per-point.
Abstract: Large language models (LLMs) offer two primary levers for improving accuracy in agentic systems: test-time scaling (e.g., Chain-of-Thought reasoning) and parameter scaling (upgrading to larger models). Despite widespread adoption, the field lacks principled evaluation of the accuracy-cost-latency trade-offs under controlled conditions. We present a comprehensive evaluation framework and conduct experiments on GSM8K (1,319 items) and PopQA (2,000-item subset) to establish these trade-offs. Our key findings reveal that: (i) on mathematical reasoning tasks, Chain-of-Thought is highly effective for smaller models but becomes redundant when internal reasoning capabilities are available; (ii) on knowledge-intensive QA, performance is primarily capacity-bound, with Chain-of-Thought often increasing costs without improving accuracy; (iii) for models with advanced reasoning capabilities, external Chain-of-Thought becomes largely redundant and can even harm performance while increasing costs. We formalize Pareto frontiers and cost-per-point metrics that translate into actionable deployment policies for more efficient agentic systems.
Supplementary Material: zip
Submission Number: 51
Loading