TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

Published: 23 May 2026, Last Modified: 23 May 2026ACM CAIS 2026: RLEval Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM routing, agentic benchmark, dynamic evaluation
TL;DR: LLM router benchmark for multi-step agentic task with fast static and live dynamic evaluation
Abstract: Production LLM agents often spend most of their budget across many intermediate calls, yet existing router bench- marks mostly evaluate one-shot prompt routing without ac- counting for multi-step agentic tasks. We introduce Twin- RouterBench, a step-level routing benchmark for choosing the cheapest sufficient model tier conditioned on the prefix visible before the next LLM call. The static track contains 970 router-visible prefixes from 520 instances across SWE- bench, BFCL, mtRAG, QMSum, and PinchBench, with ex- ecution verified tier labels and deterministic scoring over la- bel correctness, trajectory membership, and token cost. The dynamic track runs routers end-to-end on SWE-bench Ver- ified and reports official resolution and realized API spend. On a 100-case held-out SWE-bench split, a logistic router trained on the static labels achieves comparable resolution to unrouted Opus 4.6 while reducing API cost by 53.1%.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 9
Loading