Bridging Transformers and RWKV: Towards Efficient Multimodal Video Understanding

ICLR 2026 Conference Submission2866 Authors

08 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: transformer, RWKV, video understanding, distill
TL;DR: We accelerate multimodal video understanding by converting Transformer weights into RWKV, enabling more efficient inference without sacrificing performance.
Abstract: Transformer-based Multimodal Large Language Models (MLLMs) struggle to process hour-long video inputs due to the quadratic computational complexity of causal self-attention, leading to prohibitively high computational costs during both training and inference. Existing token compression approaches reduce the number of video tokens, but often suffer from significant information loss and remain inefficient for extremely long sequences. In this work, we propose a hybrid RWKV-Transformer model that distills transformer layers into linear RNNs by reusing their attention projection weights, guided by a progressive distillation strategy. Without any token reduction, when fully replaced, throughput increases by up to nearly $2\times$. Besides replacing about 25\% of standard Transformer layers with RWKV modules improves throughput by 20\% compared to the original Transformer model, while matching its performance on multiple video understanding benchmarks such as Video-MME and MLVU, and even outperforming it on VNBench and LVBench, with average scores of 74.0\% and 46.8\%, respectively.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2866
Loading