Law of Vision Representation in MLLMs

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodality Large Language Models; Computer Vision; Vision Representation
TL;DR: MLLM performance is correlated to cross-modal alignment and correspondence of its vision representation
Abstract: We introduce the "Law of Vision Representation" in multimodal large language models (MLLMs), revealing a strong correlation among cross-modal alignment, vision representation correspondence, and overall model performance. We quantify the these factors using the cross-modal Alignment and Correspondence score. Extensive experiments across fifteen distinct vision representation settings and evaluations on eight benchmarks show that the A and C scores correlate with performance following a quadratic relationship. By leveraging this relationship, we can identify and train the optimal vision representation for an MLLM, achieving a 99.7% reduction in computational cost without the need for repeated finetuning of the language model.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 979
Loading