Keywords: Multimodal, MLLM, VLM, Fine-Tuning, Post-Training
TL;DR: MARS addresses imbalanced training dynamics in MLLM finetuning by finding the optimal rank pair, leveraging our proposed dual scaling laws to guide the search.
Abstract: Fine-tuning Multimodal Large Language Models (MLLMs) with parameter-efficient methods like Low-Rank Adaptation (LoRA) is crucial for task adaptation. However, imbalanced training dynamics across modalities often lead to suboptimal accuracy due to negative interference, a challenge typically addressed with inefficient, heuristic methods like manually tuning separate learning rates.
To overcome this, we introduce **MARS** (**M**ultimodal **A**daptive **R**ank **S**earch), an approach to discover optimal rank pairs that balance training dynamics while maximizing performance.
Our key innovation, a proposed framework of dual scaling laws, enables this search: one law models module-specific convergence time to prune the search space to candidates with aligned dynamics, while the other predicts final task performance to select the optimal pair from the pruned set.
By re-purposing LoRA rank as a controller for modality-specific convergence speed, MARS achieves superior performance over baseline methods and offers a robust, automated strategy for optimizing MLLM fine-tuning.
Primary Area: optimization
Submission Number: 2909
Loading