ORFLEX: Orthogonal Reparameterization with Flexibility for Multimodal Large Language Model Fine-Tuning
Keywords: MLLM, PEFT, low rank adaptation
TL;DR: We propose a PEFT method for MLLMs that enforces orthogonality while retaining flexibility in different modality matrix subspaces, achieving state-of-the-art performance across multimodal tasks.
Abstract: Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key strategy for adapting pretrained large models with minimal trainable parameters. While most methods were developed for LLMs and later extended to multimodal domains, their direct application to multimodal large language models (MLLMs) often overlooks modality-specific discrepancies. In particular, although visual tokens are aligned with language tokens in feature space, differences persist during forward propagation, which existing LoRA-based approaches fail to address.In this work, we propose ORFLEX, a reparameterized PEFT method tailored for MLLMs. First, we observe that the LoRA column spaces associated with visual and text tokens tend to be strongly orthogonal when the parameters are decoupled. Further, we leverage this property by introducing modality-specific reparameterization branches and designing a QR-inspired decomposition of the LoRA matrix into a frozen orthogonal basis $\hat{Q}$ and a lightweight learnable matrix $\hat{R}$. In addition, we incorporate learnable Householder transformations to adaptively rotate $\hat{Q}$ while preserving orthogonality, enhancing expressiveness.Extensive experiments demonstrate that our approach consistently outperforms strong baselines on both general and domain-specific multimodal benchmarks, underscoring the effectiveness of modality-aware reparameterization to advance PEFT for MLLMs.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10252
Loading