Scalable Multimodal Fine-tuning for Foundation Models via Mixture-of-LoRA

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Foundation Models, Parameter-Efficient Fine-Tuning, Low-Rank Adaptation, Multimodal Learning, Large Language Models
TL;DR: We propose Mixture of LoRA, a novel parameter efficient fine-tuning set-up for transformer based foundation models.
Abstract: Adapting pre-trained Large Language Models (LLMs) for multimodal tasks presents a significant challenge, often hindered by the prohibitive computational cost of full fine-tuning. In this work, we introduce Mixture-of-LoRA (MoL), a novel and parameter-efficient fine-tuning framework that enables LLMs to seamlessly process and integrate multimodal inputs. MoL combines the efficiency of Low-Rank Adaptation (LoRA) with the modality-specialized design of Mixture-of-Transformers (MoT). Our approach injects small, trainable, modality-specific LoRA adapters into the frozen layers of a pre-trained LLM. While each modality's tokens are processed by these dedicated adapters to learn specialized features, the global self-attention mechanism remains intact, allowing for rich cross-modal fusion within the original LLM architecture. This design efficiently adapts the model to understand diverse data types---such as text, images, and speech---while retaining and leveraging the vast knowledge of the foundational model. Through extensive experiments, we demonstrate that MoL effectively enables pretrained foundation models to \textit{understand} and \textit{generate} multimodal tokens. Our work provides an effective and scalable solution for building multimodal systems from existing unimodal foundation models.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 5561
Loading