Keywords: Vision-language Models
Abstract: Recent advancements in vision-language models have expanded their potential for real-world applications, enabling these models to perform complex reasoning on images.
However, in the widely used fully autoregressive pipeline like LLaVA, where projected visual tokens are prepended to textual tokens, the visual tokens often number in the hundreds or thousands, making them much longer than the input prompt. This large quantity of visual tokens introduces significant computational overhead, slowing down training and inference.
In this paper, we propose Visual Compact Token Registers (Victor), a method that reduces the number of visual tokens by summarizing them into a smaller set of register tokens. Victor adds a few learnable register tokens after the visual tokens and summarizes the visual information into these registers using the first few layers in the language tower. After these few layers, all visual tokens are discarded, significantly improving computational efficiency for both training and inference. Notably, our method is easy to implement and requires a small number of new trainable parameters with minimal impact on model performance.
In our experiment, with merely $8$ visual registers—about $1\%$ of the original tokens—Victor shows less than a $4\%$ performance drop while reducing total training time by $43\%$ and boosting inference throughput by $3.36\times$.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3129
Loading