TL;DR: In this paper, we propose FedDDA, a decoupling-based method for Personalized Federated Parameter-Efficient Fine-Tuning.
Abstract: Federated Parameter-Efficient Fine-Tuning aims to adapt Vision-Language Models for downstream tasks in distributed environments. However, data heterogeneity across participants hinders collaborative effectiveness, necessitating personalized adaptation to cover distinct data distributions. Current personalized methods suffer from two limitations. 1) Textual Property Loss: Existing methods facilitate the collaboration between decoupled prompts at the feature level, which potentially undermines the textual properties of the prompts. 2) Visual Feature Diversity: The diversity of visual features makes it challenging to leverage naive image features directly for image-text alignment in downstream tasks. In this work, we propose Federated Disentangled Tuning with Textual Prior Decoupling and Visual Dynamic Adaptation (FedDDA) to overcome the above limitations. Specifically, we encourage decoupling prompts in a way that maximizes the efficacy of prior knowledge, which is essential for maintaining a coherent linguistic context. Furthermore, we design a visual adaption model to reshape visual space to optimally align with the textual space. Extensive experiments on various image classification tasks show the effectiveness of our work in addressing data heterogeneity. The codes are released at https://github.com/MoratalYang/FedDDA.
Lay Summary: Federated Parameter-Efficient Fine-Tuning facilitates collaborative and privacy-preserving fine-tuning of Vision-Language Models by updating only a limited number of parameters. However, the presence of non-independent and identically distributed (non-IID) data across clients often leads to suboptimal performance.
To address this challenge, we propose FedDDA, a simple yet effective method that leverages both textual and visual modalities. Specifically, our approach integrates global and local prompts to maintain semantic consistency during the shared learning process. Furthermore, it dynamically adapts visual representations from images to ensure better alignment with their associated textual descriptions. These mechanisms enable our approach to improve how models learn from distributed and heterogeneous data significantly.
Primary Area: Social Aspects->Privacy
Keywords: Federated Learning, Parameter-Efficient Fine-Tuning, Vision-Language Model, Data Heterogeneity
Submission Number: 8659
Loading