ResLR: Residual-Low-Rank Surrogates for Stable and Fast Context Adaptive Computing in Large Language Models
Keywords: Residual-Low-Rank, context adaptive computing, Block-Wise Multi-Path Routing, dynamic inference, self-distillation
TL;DR: ResLR, an adaptive computing framework using low-rank surrogates and block-wise routing. It preserves the model's functional hierarchy, cutting inference FLOPs by 48%-52% for a ~1.9× speedup, improves stability, and achieves SOTA task performance.
Abstract: Large Language Models (LLMs) achieve state-of-the-art results on diverse tasks, yet inference remains expensive because every token traverses the full Transformer stack. Recent context adaptive computing methods mitigate this cost by token-wise layer skipping, but their per-layer routing is volatile, leading to accuracy oscillations and an extended fine-tuning process. We trace this instability to two issues: (i) direct skips violate the model’s functional hierarchy, and (ii) per-layer routing fails to exploit the similarity of activations between neighboring layers. We therefore propose a unified acceleration framework addressing both problems. First, we introduce the Residual-Low-Rank (ResLR) surrogate, a lightweight bypass that distills the residual transformation between consecutive layers into a low-rank operator within a compact subspace, thus synthesizing the effect of the skipped layers and preserving hierarchy. Second, we devise Block-Wise Multi-Path Routing, which clusters neighboring layers into blocks and issues a single routing decision per block, explicitly leveraging activation similarity to stabilize computation and reduce gating overhead. The method integrates into standard LoRA fine-tuning without extra stages. Across question answering, mathematical reasoning, and commonsense inference benchmarks, it reduces FLOPs by 48%–52\% and yields $\sim$1.9$\times$ wall-time speed-ups while outperforming static and dynamic baselines. With feature probing suggests a $\sim$90% functional preservation, variance analysis shows 42.3% lower score standard deviation and 53.7\% more stable routing than layer-skipping approaches, establishing ResLR and block-wise routing as a robust approach for practical, low-cost LLM inference.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 6409
Loading