SPIDER: Multi-Layer Semantic Token Pruning and Adaptive Sub-Layer Skipping in Multimodal Large Language Models

ICLR 2026 Conference Submission14993 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: efficient mllm, token pruning, layer skipping
Abstract: Multimodal Large Language Models (MLLMs) face significant efficiency challenges that stem from two distinct yet coupled sources: data redundancy and computational redundancy. While most methods focus on data redundancy by pruning visual tokens from the output of the visual encoder or computing redundancy in LLM decoders using blockwise importance, the finer-grained inter-layer representation shifts and the distribution differences within the layers themselves have not been fully explored. In this work, we comprehensively investigate this dual-level inefficiency. We posit that intermediate layer tokens from vision encoders should be considered for effective visual token pruning, as semantic focus shifts across layers, with middle-layer tokens capturing more detailed object-centric information that deeper layers may abstract away. Furthermore, we reveal the differential contributions of Attention and FFNs across distinct LLM decoder layers. Building upon these discoveries, we propose \textbf{SPIDER}, a training-free framework that integrates multi-layer \underline{\textbf{S}}emantic visual token \underline{\textbf{P}}run\underline{\textbf{I}}ng with an a\underline{\textbf{D}}aptive sub-lay\underline{\textbf{ER}} skipping mechanism. Experimental evaluations demonstrate that SPIDER consistently maintains strong performance across various MLLM architectures and reduction ratios. Notably, SPIDER achieves a reduction to $20\%$ in FLOPs for LLaVA-Next-7B, while preserving $98.7\%$ performance to the baseline.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 14993
Loading