Keywords: Federated Learning, Multimodal Large Language Models
Abstract: As Multimodal Large Language Models (MLLMs) continue to be trained, the availability of public data diminishes, limiting the possibility for further training and adaptation. However, private data remains an underutilized yet valuable resource. Federated Learning (FL) enables decentralized training on private data, yet extending it to MLLMs is challenging: heterogeneous client modalities induce architectural incompatibility, and full-parameter fine-tuning of billion-scale models incurs prohibitive communication costs. Parameter-efficient methods like LoRA alleviate these issues but introduce aggregation inconsistency, as averaged low-rank updates fail to faithfully recover the true global update.
To address these issues, we propose (UniFLoW Universal multi-modal Federated LoRA fine-tuning framework With Analytical Aggregation), a unified federated framework that leverages pre-trained large models, a multi-modal architecture, and our proposed Federated Aggregating Analytical Low-Rank Adaption ($FedA^2$-$LoRA$). UniFLoW effectively utilizes fragmented client-side multi-modal data while ensuring consistent aggregation. And modality-specific encoders and a two stage training strategy ensure effective integration of diverse modalities without overfitting.
Experiments on text, image, and speech demonstrate that \textbf{UniFLoW} enables scalable, communication-efficient, and aggregation-consistent federated fine-tuning,
with $FedA^2$-$LoRA$ achieving state-of-the-art performance compared to existing FedLoRA approaches. We envision \textbf{UniFLoW} as a promising solution to the growing scarcity of public data.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 948
Loading