Cross-Modal Alignment Regularization: Enhancing Language Models with Vision Model Representations

Yulu Gan; Kaiya Ivy Zhao; Phillip Isola

Cross-Modal Alignment Regularization: Enhancing Language Models with Vision Model Representations

Yulu Gan, Kaiya Ivy Zhao, Phillip Isola

Published: 06 Mar 2025, Last Modified: 21 Apr 2025ICLR 2025 Re-Align Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Domain: machine learning

Abstract: Cross-modal distillation has emerged as a critical technique for leveraging strengths across different modalities. However, existing methods have not enabled performance benefits between models trained on different modal data. In this work, we introduce a cross-modal alignment regularization (CMAR) term into language model training, aligning its representations with those of a vision model at specific layers. Our experiments demonstrate that our method enhances language model performance across various downstream tasks, in both pre-training and fine-tuning settings. Specifically, in the pre-training setting, we observe accuracy improvements of 1.01\% on the Language Modeling Broadened to Account for Discourse Aspects (LAMBADA) dataset and 1.49\% on the Causal Reasoning (COPA) dataset. Our method also proves effective in the fine-tuning setting, boosting performance by 1.20\% on LAMBADA and 2.00\% on COPA, indicating that a vision model can substantially enhance language model performance. CMAR provides a simple yet effective strategy to consistently enhance language model performance through representation alignment with vision models, which opens new avenues for improving model performance through direct cross-modal representation alignment.

Submission Number: 41

Loading