GOAL: Balance Multimodal Learning with Gradient Orthogonalization and Adaptive Leveraging

Chiyu Chen; Jianqiao Sun; Hao Zhang; Zhengjue Wang; Jiawei Ma; Bo Chen; Hongwei Liu

GOAL: Balance Multimodal Learning with Gradient Orthogonalization and Adaptive Leveraging

Chiyu Chen, Jianqiao Sun, Hao Zhang, Zhengjue Wang, Jiawei Ma, Bo Chen, Hongwei Liu

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: modality imbalance; multimodal; gradient

Abstract: Multimodal learning, integrating the information from multiple sensory modalities, is naturally expected to outperform the single-modal counterparts. However, the heterogeneity of multimodal data often leads to two imbalance problems that impede unimodal representation learning prior to fusion. The first problem arises from inconsistent gradient magnitudes across modalities, and the second from opposing gradient directions in a unimodal encoder due to competing losses. While recent progress is achieved by strengthening within-modality representations, we identify cross-modality compatibility as another critical factor for effective feature fusion. Jointly considering these two factors for better fusion, we propose the Gradient Orthogonalization and Adaptive Leveraging (GOAL), a parameter-free gradient modification method. Specifically, guided by the principle that imbalanced dependency on each modality follows the inverse relationship with prediction variance, the AL dynamically re-weights gradient magnitude by utilizing prediction entropy as a variance estimator. Furthermore, the GO ensures a synergistic update to obtain the compatible multimodal features through the projection of conflicting gradients. Extensive experiments across various modalities and frameworks indicate that GOAL consistently and significantly outperforms existing state-of-the-art methods, providing a plug-and-play module for multimodal optimization. Our code will be made publicly available.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6138

Loading