SFPruner: Single-Forward Visual Token Subset Selection for Resource-Efficient Multimodal Foundation Model Inference
Keywords: Multimodal Foundation Models, Visual Token Pruning, Resource-Efficient Inference, Token-Level Adaptation, Quality-Resource Trade-offs
TL;DR: SFPruner performs single-forward, redundancy-aware visual token pruning for efficient high-resolution MLLM inference.
Abstract: High-resolution multimodal foundation models allocate substantial inference compute to visual tokens, making visual-token subset selection a central challenge for resource-efficient deployment. Existing pruning methods face a fundamental trade-off: fast heuristic methods introduce little overhead but provide limited redundancy control, while combinatorial subset-optimization methods better preserve diversity but rely on sequential greedy search, which can erode wall-clock gains. We propose SFPruner, a single-forward approximation to redundancy-aware visual token subset selection. Rather than constructing the subset iteratively, SFPruner embeds redundancy control directly into the scoring space through semantics-guided ridge leverage and ranking-based directional masking, enabling one-shot Top-K selection under deployment-specified token budgets. Across image and video MLLMs, SFPruner preserves competitive quality-resource trade-offs while substantially reducing selection overhead, for example from 112.4 ms to 2.5 ms on Qwen2.5-VL at a 512-token budget. These results highlight that, for token pruning to deliver real wall-clock speedups, the selection policy must be both redundancy-aware and computationally lightweight.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 77
Loading