Understanding Vision and Language Representations under the Lens of Intrinsic Dimension

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Intrinsic Dimension, vision and language, cross modal representation, pruning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Current multimodal representation learning is mainly achieved by intuitive and heuristic approaches. However, the cooperation and the utility of each modality remain unclear. We empirically investigate the intrinsic dimension (ID) of a large-scale vision-language pre-training model BLIP and explore the relationships among intrinsic dimension, modality, and prunability. It is shown that the ID geometric characteristics of visual and language representations differ significantly in terms of range and shape, resulting in distinct prunability for each modality. Unified multimodal learning can be manifested as the overlay of ID variations of vision and language. By investigating the IDs of attention representations, it is evident that the current cross-modal attention mechanism struggles to embed modalities into the same low-dimensional manifold due to the varying levels of IDs between vision and language. Moreover, We study the contribution of different modalities toward model prunability and explore predicting model performance through the distributions of IDs. An importance metric based on ID is proposed, which yields superior performance for multimodal model pruning. The experimental results show that visual representations are more sensitive and fragile to pruning, while language representations are robust and, therefore, have a higher prunability. 90% BLIP weights in language modality can be pruned with only 3.8 drops on the CIDEr metric. Our observations suggest the potential for more effective pruning of multimodal models using intrinsic dimension (ID) as a guiding metric.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4796
Loading