Keywords: Agentic AI, Small Language Model, Model Context Protocol, Tokenomics, Ensemble, IFEval, Beyond Leaderboard Evaluation
TL;DR: Beyond-leaderboard multi-dimensional evaluation of agentic ensembles of small language models.
Abstract: As large language models (LLMs) move from standalone assistants
into agentic workflows, evaluation must extend beyond scalar
leaderboard accuracy to account for operational reliability, cost,
latency, and token efficiency. We use an agentic ensemble of small
language models (SLMs) with an SLM-judge-mediated feedback
loop as a case study for such beyond-leaderboard evaluation. On the
541-prompt IFEval benchmark, the best ensemble achieves 97.34%
strict prompt accuracy, exceeding the strongest standalone LLM
baseline, gpt-5.4, by 5.81 percentage points while operating in a
lower-cost regime. We then analyze the tokenomics and operational
behavior behind this gain, including cost per sample, token composition,
useful-output goodput, feedback-loop recovery, latency
decomposition, and performance across instruction categories and
constraint counts. Our results show that agentic SLM ensembles can
trade additional test-time tokens and orchestration overhead for improved
instruction-following fidelity, motivating multi-dimensional
evaluation protocols for future agentic AI systems.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 15
Loading