Scalable multi-modal representation learning networks

Zihan Fang, Ying Zou, Shiyang Lan, Shide Du, Yanchao Tan, Shiping Wang

Published: 2025, Last Modified: 21 Jan 2026Artif. Intell. Rev. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multi-modal representation learning is recognized for its comprehensive interpretation across diverse modalities. Although existing approaches have yielded favorable results, they face challenges in high-order information preservation and out-of-sample data generalization. To tackle these issues, we propose a scalable multi-modal representation learning networks framework, which aims to learn optimal modality-specific projection matrices to project multi-modal features to a shared representation space. Specifically, weight guided modality-wise and row-sparsity driven feature-wise measures are considered to achieve adaptively hierarchical feature selection from the original data. Then, within the unified latent representation space, we employ hypergraph embedding to preserve the intricate high-order local geometric structures within the modality-specific high-dimensional spaces. Finally, we propose a proximal operator-inspired network architecture to resolve the optimization objectives, streamlining the process of feature auto-weighted selection and representation learning. The experimental results highlight the effectiveness and superiority of the proposed method, while online testing on out-of-sample data further demonstrates robust generalization. The code of the proposed method is publicly available at: https://github.com/ZihanFang11/SMMRL.

External IDs:dblp:journals/air/FangZLDTW25