Support Your Local LMs: Redistributing LM Traffic from Cloud to Edge with TrafficBench

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: model routing, local-cloud compute, inference workloads, inference-time compute
TL;DR: TrafficBench benchmark (1M queries) shows 80.7% of LLM workloads can run on small local models (<20B), with routers achieving 77.1% energy, 67.1% compute, and 60.2% cost savings vs cloud-only deployment.
Abstract: The vast majority of large language model (LLM) queries today are processed by frontier models in centralized cloud infrastructure. However, recent advances have produced small language models (≤20B parameters) that match or exceed larger models on many tasks while offering superior energy and cost efficiency. To better understand what fraction of inference workloads can be shifted away from cloud to local compute, we present TrafficBench, a comprehensive benchmark for evaluating query routing between local and cloud-deployed LLMs. TrafficBench is comprised of 1M real-world queries derived from ChatGPT user conversations and naturalistic reasoning queries, with evaluations across 10 state-of-the-art (SOTA) models, 4 hardware accelerators, and 8 performance metrics. Using TrafficBench, we address three critical questions: (1) what fraction of current inference queries can be handled by small LMs on local accelerators, (2) how effectively can modern routing architectures identify these queries, and (3) what are the downstream efficiency implications of local routing? Our analysis reveals that 80.7% of TrafficBench queries can be successfully handled by small local models, with coverage varying by domain—exceeding 90% for creative tasks but dropping just below 68% for technical fields. We start by evaluating existing SOTA embedding- and decoder-based routing approaches, finding that they do not push the Pareto frontier beyond individual local models. To enable better routing, we introduce a novel binary variation of decoder-based routing that achieves superior performance (F1 = 0.851) when we have access to large training datasets (>100K); we also show that embedding models excel in data-constrained settings (<10K). When deployed over real-world traffic distributions, our decoder-based router reduces energy by 77.1%, compute by 67.1%, and cost by 60.2% versus cloud-only deployment, while maintaining comparable task accuracy. Our longitudinal analysis from 2023-2025 shows a 9.5× improvement in intelligence efficiency (accuracy per watt), with the fraction of locally-serviceable queries increasing from 23.2% to 80.7%, suggesting significant efficiency gains from better routing systems. We release TrafficBench along with a hardware-agnostic profiling harness for measuring model efficiency metrics (e.g., energy utilization), enabling reproducible benchmarking and supporting new research as models and accelerators emerge.
Primary Area: datasets and benchmarks
Submission Number: 5453
Loading