Law of Vision Representation in MLLMs

Shijia Yang; Bohan Zhai; Quanzeng You; Jianbo Yuan; Hongxia Yang; Chenfeng Xu

Law of Vision Representation in MLLMs

Shijia Yang, Bohan Zhai, Quanzeng You, Jianbo Yuan, Hongxia Yang, Chenfeng Xu

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodality Large Language Models; Computer Vision; Vision Representation

TL;DR: MLLM performance is correlated to cross-modal alignment and correspondence of its vision representation

Abstract: We introduce the "Law of Vision Representation" in multimodal large language models (MLLMs), revealing a strong correlation among cross-modal alignment, vision representation correspondence, and overall model performance. We quantify the these factors using the cross-modal Alignment and Correspondence score. Extensive experiments across fifteen distinct vision representation settings and evaluations on eight benchmarks show that the A and C scores correlate with performance following a quadratic relationship. By leveraging this relationship, we can identify and train the optimal vision representation for an MLLM, achieving a 99.7% reduction in computational cost without the need for repeated finetuning of the language model.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 979

Loading