Keywords: vision-language model;Incomplete Multimodal Learning
TL;DR: Adaptive Incomplete Multimodal Learning via LoRA-based Mixture of Experts
Abstract: In real-world multimodal applications, modality missing frequently arises due to device failures, network instability, or privacy concerns. Latest studies introduce prompt-based fine-tuning process to adapt models in such incomplete multimodel learning scenarios. However, these methods struggle from two aspects: (1) static prompts are modality-aware but instance-invariant, ignoring instance-specific characteristics. (2) the complexity of prompts are coupled with the number of modalities, hindering its scalability. Different from existing prompt-based methods, we propose \textbf{M}ixture of LoRA Experts for \textbf{A}daptive \textbf{I}ncomplete Multimodal \textbf{L}earning, named MAIL. Specifically, We design the LoRa-based Mixture of Experts and insert them into the pre-trained model to achieve adaptive incomplete multimodal learning. By training on datasets containing randomly missing modalities, MAIL can adaptively select a fixed combination of LoRA experts based on the current modality missingness and data unique characteristics. Accordingly, the parameter complexity depends only on a hyperparameter controlling the total number of experts, effectively decoupling it from the number of modalities. Extensive experimental comparisons with 11 baseline models on three real-world datasets demonstrate that MAIL can effectively handle incomplete modality problems compared to 11 baselines.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16029
Loading