CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models

Qinsi Wang; Hancheng Ye; Ming-Yu Chung; Yudong Liu; Yueqian Lin; Martin Kuo; Mingyuan Ma; Jianyi Zhang; Yiran Chen

CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models

Qinsi Wang, Hancheng Ye, Ming-Yu Chung, Yudong Liu, Yueqian Lin, Martin Kuo, Mingyuan Ma, Jianyi Zhang, Yiran Chen

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency.

Abstract: Vision-Language Models (VLMs) excel across diverse tasks but suffer from high inference costs in time and memory. Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations, both offering promising solutions to enhance efficiency. Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently. However, a fundamental yet underexplored question remains: Do they truly operate in isolation, or is there a deeper underlying interplay that has yet to be uncovered? In this paper, we conduct the first comprehensive investigation into this question. By introducing and analyzing the matching mechanism between Core Neurons and Core Tokens, we found that key neurons and tokens for inference mutually influence and reinforce each other. Building on this insight, we propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency. Through theoretical analysis and efficiency evaluations, we demonstrate that the proposed method surpasses state-of-the-art baselines on ten image understanding tasks and three hardware devices. Notably, on the NVIDIA Titan Xp, it achieved 5$\times$ FLOPs reduction and a 10$\times$ overall speedup. Code is released at [https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main](https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main).

Lay Summary: Modern AI models that understand both images and language are powerful but often slow and resource-hungry. To make them faster, researchers have tried two main strategies: focusing only on the most important pieces of the input (like key words in a sentence or key parts of an image) and reducing the amount of brain-like activity inside the model. So far, these two ideas have mostly been studied separately. In our work, we ask a simple but important question: what if these two strategies actually help each other? We find that the most useful parts of the input and the most important parts of the model tend to match up—and this connection can be used to make the model even more efficient. Based on this insight, we designed a new method called CoreMatching, which smartly selects both the key inputs and the key model components at the same time. This leads to much faster AI with almost no drop in performance. Our approach works well across many vision tasks and devices—on one common graphics card, it runs up to 10 times faster than current methods.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/wangqinsi1/2025-ICML-CoreMatching/tree/main

Primary Area: Deep Learning->Large Language Models

Keywords: Vision-language model, VLM inference acceleration

Submission Number: 5437

Loading