ResLR: Residual-Low-Rank Surrogates for Stable and Fast Context Adaptive Computing in Large Language Models

Linghe Zhang; Yuehao Li; Haifang Jian; Gaobin Huang; Hongchang Wang; Yuhao Liu; Junran Li; Wu Liu

ResLR: Residual-Low-Rank Surrogates for Stable and Fast Context Adaptive Computing in Large Language Models

Linghe Zhang, Yuehao Li, Haifang Jian, Gaobin Huang, Hongchang Wang, Yuhao Liu, Junran Li, Wu Liu

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Residual-Low-Rank, context adaptive computing, Block-Wise Multi-Path Routing, dynamic inference, self-distillation

TL;DR: ResLR, an adaptive computing framework using low-rank surrogates and block-wise routing. It preserves the model's functional hierarchy, cutting inference FLOPs by 48%-52% for a ~1.9× speedup, improves stability, and achieves SOTA task performance.

Abstract: Large Language Models (LLMs) achieve state-of-the-art results on diverse tasks, yet inference remains expensive because every token traverses the full Transformer stack. Recent context adaptive computing methods mitigate this cost by token-wise layer skipping, but their per-layer routing is volatile, leading to accuracy oscillations and an extended fine-tuning process. We trace this instability to two issues: (i) direct skips violate the model’s functional hierarchy, and (ii) per-layer routing fails to exploit the similarity of activations between neighboring layers. We therefore propose a unified acceleration framework addressing both problems. First, we introduce the Residual-Low-Rank (ResLR) surrogate, a lightweight bypass that distills the residual transformation between consecutive layers into a low-rank operator within a compact subspace, thus synthesizing the effect of the skipped layers and preserving hierarchy. Second, we devise Block-Wise Multi-Path Routing, which clusters neighboring layers into blocks and issues a single routing decision per block, explicitly leveraging activation similarity to stabilize computation and reduce gating overhead. The method integrates into standard LoRA fine-tuning without extra stages. Across question answering, mathematical reasoning, and commonsense inference benchmarks, it reduces FLOPs by 48%–52\% and yields $\sim$1.9$\times$ wall-time speed-ups while outperforming static and dynamic baselines. With feature probing suggests a $\sim$90% functional preservation, variance analysis shows 42.3% lower score standard deviation and 53.7\% more stable routing than layer-skipping approaches, establishing ResLR and block-wise routing as a robust approach for practical, low-cost LLM inference.

Supplementary Material: zip

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 6409

Loading