Bring Future Vision: Dynamic Computation Allocation Guided by Lightweight Feature Forecaster

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Computational Efficiency, Dynamic Computation Allocation, Informed Routing
TL;DR: This paper proposes "Informed Routing" to predict a token's recoverability before making routing decisions for dynamic computation allocation.
Abstract: The deployment of large language models (LLMs) in practical scenarios is hindered by their massive computational overhead. While token-wise computation allocation emerges as a promising solution, existing methods suffer from irreversible information loss and suboptimal token selection due to the $\textit{greedy routing}$ paradigm. This paper introduces a novel paradigm, $\textit{informed routing}$, which proactively addresses these limitations. Our key insight is to employ Lightweight Feature Forecasters (LFF) — simple, low-cost networks that learn to approximate the transformations of individual model components — before making any routing decisions. This allows the router to assess a token's recoverability (ease of approximation) rather than just its immediate importance. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across various sparsity levels on language modeling and reasoning tasks. Notably, even without final LoRA fine-tuning, our method matches or surpasses strong baselines that require full fine-tuning, all while reducing training time by over 50\%.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 16918
Loading