The Token Tax : Measuring the Diminishing Returns of Test-Time Compute in Agentic Pipelines

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Autonomous Agents, Multi-Agent Systems, Data Science Agents, AI Safety, Reasoning Models, Runtime Compute, Reasoning Efficiency, Large Language Models, Target Leakage, Trustworthy AI
TL;DR: Autonomous data science agents exhibit diminishing returns from excessive test-time reasoning, and our runtime governance framework (LAS) improves integrity and reasoning efficiency while reducing token and carbon overhead
Abstract: The assumption that increased test-time reasoning improves performance is driving the adoption of autonomous LLM agents in data science pipelines. Using the Living Agentic System (LAS), we show that this assumption fails in ML workflows due to a ``Token Tax,'' where larger reasoning budgets sharply increase token costs with marginal utility gains. Across DeepSeek-7B, Mistral-7B, and Llama-3.1:8B, we observe a stable efficiency frontier in which low planning budgets achieve correct pipeline behavior, while additional reasoning results in planning inflation, increasing cost by over 110\% with negligible quality improvement. We further demonstrate that intent-aware governance eliminates leakage-related false positives, reducing them from 74.5\% to 0\%, providing a practical foundation for efficient and reliable agentic data science systems.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 175
Loading