Keywords: LLM routing, agentic benchmark, dynamic evaluation
TL;DR: LLM router benchmark for multi-step agentic task with fast static and live dynamic evaluation
Abstract: Production LLM agents often spend most of their budget
across many intermediate calls, yet existing router bench-
marks mostly evaluate one-shot prompt routing without ac-
counting for multi-step agentic tasks. We introduce Twin-
RouterBench, a step-level routing benchmark for choosing
the cheapest sufficient model tier conditioned on the prefix
visible before the next LLM call. The static track contains
970 router-visible prefixes from 520 instances across SWE-
bench, BFCL, mtRAG, QMSum, and PinchBench, with ex-
ecution verified tier labels and deterministic scoring over la-
bel correctness, trajectory membership, and token cost. The
dynamic track runs routers end-to-end on SWE-bench Ver-
ified and reports official resolution and realized API spend.
On a 100-case held-out SWE-bench split, a logistic router
trained on the static labels achieves comparable resolution
to unrouted Opus 4.6 while reducing API cost by 53.1%.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 9
Loading