Beyond Leaderboards: Tokenomics of Agentic Small Language Model Ensembles

Published: 23 May 2026, Last Modified: 25 May 2026ACM CAIS 2026: RLEval Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Agentic AI, Small Language Model, Model Context Protocol, Tokenomics, Ensemble, IFEval, Beyond Leaderboard Evaluation
TL;DR: Beyond-leaderboard multi-dimensional evaluation of agentic ensembles of small language models.
Abstract: As large language models (LLMs) move from standalone assistants into agentic workflows, evaluation must extend beyond scalar leaderboard accuracy to account for operational reliability, cost, latency, and token efficiency. We use an agentic ensemble of small language models (SLMs) with an SLM-judge-mediated feedback loop as a case study for such beyond-leaderboard evaluation. On the 541-prompt IFEval benchmark, the best ensemble achieves 97.34% strict prompt accuracy, exceeding the strongest standalone LLM baseline, gpt-5.4, by 5.81 percentage points while operating in a lower-cost regime. We then analyze the tokenomics and operational behavior behind this gain, including cost per sample, token composition, useful-output goodput, feedback-loop recovery, latency decomposition, and performance across instruction categories and constraint counts. Our results show that agentic SLM ensembles can trade additional test-time tokens and orchestration overhead for improved instruction-following fidelity, motivating multi-dimensional evaluation protocols for future agentic AI systems.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 15
Loading