Taming Vision-Language Models for Federated Foundation Models on Heterogeneous Medical Imaging Modalities

Lulu Feng, Shengchao Chen

Published: 01 Jan 2025, Last Modified: 23 Sept 2025ICMR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Training federated foundation models (FFMs) for sensitive medical images presents open challenges due to complex data heterogeneity. Related studies focus on the problem within a single imaging modality; however, these methods lacks the flexibility required to generalize related tasks uniformly across different imaging modalities. This paper proposes FFMed, a federated learning framework that tames pretrained Vision-Language Models for FFMs training, targeting medical image classification across heterogeneous medical imaging modalities. Specifically, FFMed improves the CLIP's medical image-text alignment through Adaptive Prompt Generation to introduce task/domain-specific, informative prompts in conjunction with low-rank adaption. To mitigate learning bias from imaging modalities heterogeneity across clients, we propose Anchor-based Dynamic Regularization, which dynamically constrains local optimization to remain close to the global stationary point, thereby promoting optimal global consensus. Ultimately, FFMed fosters a unified model that effectively generalizes across diverse non-IID environments. Extensive experiment on real-world medical image datasets demonstrate the effectiveness and superiority of FFMed.