DTP: Delta-Guided Two Stage Pruning for Mamba-based Multimodal Large Language Models

ICLR 2026 Conference Submission17083 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mamba, Multimodal Large Language Models, Token Pruning, Efficiency, Interpretability
Abstract: Multimodal large language models built on the Mamba architecture offer efficiency advantages, yet remain hampered by redundant visual tokens that inflate inference cost, with the prefill stage accounting for the majority of total inference time. We introduce Delta-guided Two stage Pruning (DTP), a method that progressively reduces token redundancy through selective pruning at early layer and complete pruning at late layer. Unlike Transformer-oriented pruning methods, our approach derives token importance directly from Mamba’s internal parameters. The statistical distribution of these importance scores, combined with implicit attention patterns, then provides the basis for determining both the pruning layers and the tokens to be removed. Extensive evaluation across diverse benchmarks demonstrates that DTP reduces computation by nearly 50\% while preserving task performance more effectively than existing pruning methods under the same reduction setting. Beyond efficiency, our analysis reveals previously underexplored behaviors of visual tokens within Mamba layers, suggesting a principled perspective for designing future pruning techniques in Mamba-based Multimodal Large Language Models.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 17083
Loading