Keywords: Large Multimodal Model, Inference Acceleration, Token Reduction
Abstract: Large Multimodal Models (LMMs) have shown remarkable success in image understanding tasks. LMMs encode visual and textual inputs into tokens, which are then fed into Large Language Models (LLMs). However, the large number of visual tokens poses a major bottleneck for inference efficiency and memory usage. Reducing visual tokens is a promising training-free solution, but existing methods remain limited: importance-based approaches often yield redundant selections, diversity-based ones overlook differences among tokens themselves. Two-stage hybrid methods inherit shortcomings form importance-based selection and result in suboptimal choices. To address this, we formulate token reduction as an optimal subset selection problem and identify two key criteria for a good subset: informativeness and coverage, to guide the selection that best preserves LLM output fidelity. Based on these principles, we propose a token selection framework that jointly optimizes both. CoIn integrates visual saliency, cross-modal relevance, and representational novelty into a unified scoring function, enabling the selection of a compact yet expressive token subset. It is efficient, model-agnostic, and compatible with modern inference accelerators. Experiments on multiple benchmarks demonstrate that CoIn substantially reduces computation and memory cost while maintaining strong task performance.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 15936
Loading