Abstract: Multimodal Large Language Models (MLLMs) excel in tasks like multimodal reasoning and cross-modal retrieval but face deployment challenges in real-world scenarios due to distributed multimodal data and strict privacy requirements. Federated Learning (FL) offers a solution by enabling collaborative model training without centralizing data. However, integrating MLLMs into FL introduces challenges such as high computational demands, limited client capacity, substantial communication costs, and heterogeneous client data. Existing FL methods, which require deploying full models on clients, are impractical in these settings. To address these limitations, we propose **FedNano**, a novel FL framework that centralizes the LLM on the server while introducing NanoEdge, a lightweight module for client-specific adaptation. NanoEdge employs modality-specific encoders, connectors, and trainable NanoAdapters with low-rank adaptation, achieving a **95\% reduction** in client-side model storage and a transmission overhead of just **0.01\%** of model parameters. By transmitting compact updates of NanoAdapters, FedNano effectively handles client heterogeneity and resource constraints, providing a scalable, privacy-preserving solution for MLLM deployment. Experiments show that FedNano outperforms existing methods, bridging the gap between MLLM complexity and FL constraints and enabling efficient, decentralized multimodal AI systems.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: federated learning, multimodal learning, vision question answering, multimodal QA
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 2234
Loading