Track: long paper (up to 10 pages)
Domain: machine learning
Abstract: Cross-modal distillation has emerged as a critical technique for leveraging strengths across different modalities. However, existing methods have not enabled performance benefits between models trained on different modal data. In this work, we introduce a cross-modal alignment regularization (CMAR) term into language model training, aligning its representations with those of a vision model at specific layers. Our experiments demonstrate that our method enhances language model performance across various downstream tasks, in both pre-training and fine-tuning settings. Specifically, in the pre-training setting, we observe accuracy improvements of 1.01\% on the Language Modeling Broadened to Account for Discourse Aspects (LAMBADA) dataset and 1.49\% on the Causal Reasoning (COPA) dataset. Our method also proves effective in the fine-tuning setting, boosting performance by 1.20\% on LAMBADA and 2.00\% on COPA, indicating that a vision model can substantially enhance language model performance. CMAR provides a simple yet effective strategy to consistently enhance language model performance through representation alignment with vision models, which opens new avenues for improving model performance through direct cross-modal representation alignment.
Submission Number: 41
Loading