Routing Beats Latency

Routing Beats Latency

TMLR Paper9611 Authors

09 Jun 2026 (modified: 15 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The deployment of deep neural networks that operate under hard latency constraints has to compromise accuracy against speed, especially where the input difficulty is widely varying in practice. Although the high-capacity Transformers like DeiT-S-16 have good robustness, the inference cost cannot be deployed to the edges as it is prohibitive. Lightweight CNNs such as SqueezeNet on the other hand gain high throughput but in severe blur and noisy behaviour they struggle for accuracy. Our phase-based study indicates that heuristic measures of complexity such as entropy-based difficulty predictor ($R^2 = 0.112$) and complexity scores on CIFAR-10 derived by ICNet-9600 are not useful routing cues because of low variance and poor predictive structure. To address these shortcomings, we propose a framework of Adaptive Mixture of Experts (MoE) that achieves train dense route sparse inference with a lightweight SqueezeNet-Light Gater. The Gater works at downsampled inputs $96 \times 96$ and launches a Gumbel-Softmax straight-through estimate which has only 0.82 ms overhead, and just a latency-aware composite loss with a trade-off parameter to be traded at (referred to as $\lambda$) allows the system to jointly optimise the accuracy and predicted inference cost. Experiments of Intel Image Classification data (augmented with controlled Gaussian blur and noise) indicate that the system finds a dynamic Pareto frontier. When the router is configured to use $\lambda=0.7$, the router puts 98\% of clean images in the fastest expert, and switches to 90\% DeiT routing with harsh corruption, reflecting human interpretable patterns of difficulty. It reaches 82.71\% accuracy at an average of 5.39 ms latency - which is a speedup of approximately $\sim29\%$ over DeiT-S-16 (7.56 ms) with only a small drop in robustness across different levels of corruption. We have shown, using our results, that latency-aware differentiable routing is both more efficient and adaptive than both static CNNs and Transformers in terms of accuracy-latency Pareto frontiers.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Wei_Liu3

Submission Number: 9611

Loading