Keywords: Multimodal Large Language Models, Parameter-Efficient Fine-Tuning, Low-Rank Adaptation (LoRA), Mixture of Experts, Modality-Aware Learning, Multimodal Question Answering
Abstract: Multimodal large language models (MLLMs) face challenges in efficiently adapting to diverse input types, such as text and images, due to the difficulty of processing heterogeneous modalities with a uniform approach. Traditional parameter-efficient fine-tuning (PEFT) methods, like LoRA, often treat all modalities equally, overlooking the need for modality-specific processing. To address this, we propose MAMoE-LoRA, a modality-aware framework that enhances expert specialization through a mixture-of-experts (MoE) architecture. Our approach organizes experts into three distinct pools: modality-specific experts for each input type, modality-shared experts for cross-modal integration, and always-active experts for consistent, domain-agnostic adaptation. We introduce an enhanced gating mechanism that utilizes causal-aware features and modality embeddings to intelligently route tokens to the most suitable experts. Additionally, we apply similarity regularization to maintain expert diversity and prevent overfitting. Experiments across multiple multimodal benchmarks demonstrate that MAMoE-LoRA achieves strong performance with minimal parameter overhead, requiring only 1.83–2.53\% of trainable parameters while outperforming existing PEFT methods.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: LLM Efficiency,parameter-efficient-training,multimodality
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 4872
Loading