# Decoupled Training with Local Reinforcement Fine-Tuning in Federated Learning


> Federated Learning (FL) with pre-trained Vision-Language Models (VLMs) has emerged as a promising paradigm for various downstream tasks. By leveraging its strong representations, recent studies improve task adaptation under insufficient local data while preserving generalization. However, these methods emphasize fully local optimization with simple parameter aggregation, which can amplify inter-client optimization inconsistency and intra-client over-specialization under heterogeneous and full-data FL settings, making it difficult to balance global task adaptation and generalization. To address these challenges, we propose FedDTL, a novel federated VLM framework that decouples the image encoder and text encoder across clients and the server. Through decoupled encoder training with server-client modality alignment, FedDTL promotes coherent global semantic update and reduces inter-client optimization inconsistency, improving global task adaptation. To further mitigate intra-client over-specialization, we introduce a two-stage local fine-tuning, where a supervised fine-tuning stage enables rapid and reliable warm-start, followed by a reinforcement learning stage that enhances generalization. Extensive experiments on multiple benchmarks, including label skew and feature shift, demonstrate that FedDTL achieves an effective balance between global task adaptation and generalization under various FL data distributions in both few-shot and full-data regimes.

This framework of FedDTL is shown below.

![The framework of FedDTL](FedDTL.png)

## Requirements

```setup
ftfy=6.3.1
gdowm=5.2.0
numpy=2.0.1
regex=2024.11.6
spicy=0.16.0
timm=1.0.15
torch=2.5.1
torchvison=0.20.1
tqdm=4.67.1
yacs=0.1.8
```

## Model Training and Testing

To fine-tune the model with FedDTL, run this command:

```main
python main_final.py // b2n class generalization task
python main_domain_final.py // cross-domain feature shift generalization task
```
Step 1: Download the pre-trained model parameters and datasets.

Step 2: Set appropriate hyperparameters.

Step 3: Select the data setting.

> `example.sh` shows some running examples.

## Datasets and Pre-trained Models

### Dataset

- The benchmark datasets can be downloaded following the official links.
  - [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz)

  - [CIFAR100](https://www.cs.toronto.edu/~kriz/cifar.html)

  - [EuroSAT](https://github.com/phelber/eurosat?tab=readme-ov-file)

  - [OxfordPet](https://www.robots.ox.ac.uk/~vgg/data/pets/)

  - [Flower102](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/)

  - [Caltech101](https://s3.us-west-2.amazonaws.com/caltechdata/47/20/fc77-d78a-4c50-81c9-d47c2004df45/data?response-content-type=application%2Foctet-stream&response-content-disposition=attachment%3B%20filename%3Dcaltech-101.zip&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIARCVIVNNAP7NNDVEA%2F20240623%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20240623T042926Z&X-Amz-Expires=60&X-Amz-SignedHeaders=host&X-Amz-Signature=28852945e5e166b6dbed5bf841c8ab5a760b8e58b72ba0a41fc2871c65e777e9)

  - [Caltech256](https://www.kaggle.com/datasets/jessicali9530/caltech256)

  - [Tiny-ImageNet](https://www.kaggle.com/datasets/akash2sharma/tiny-imagenet)

  - [Food101](https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/)

  - [Office-Caltech10](https://cas-bridge.xethub.hf.co/xet-bridge-us/6913f8c085f8cc585ec76c88/ec2fe67efb671dc06f2c6642b7d04281e56ed2529c73efc6a79659f458bd56e2?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=cas%2F20251212%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20251212T145500Z&X-Amz-Expires=3600&X-Amz-Signature=cf62404e1464278bb30264207fcbbf48910c4f5cbe7b96e3f8eda7ed861f0f8e&X-Amz-SignedHeaders=host&X-Xet-Cas-Uid=public&response-content-disposition=inline%3B+filename*%3DUTF-8%27%27office_caltech_10_dataset.zip%3B+filename%3D%22office_caltech_10_dataset.zip%22%3B&response-content-type=application%2Fzip&x-id=GetObject&Expires=1765554900&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc2NTU1NDkwMH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2FzLWJyaWRnZS54ZXRodWIuaGYuY28veGV0LWJyaWRnZS11cy82OTEzZjhjMDg1ZjhjYzU4NWVjNzZjODgvZWMyZmU2N2VmYjY3MWRjMDZmMmM2NjQyYjdkMDQyODFlNTZlZDI1MjljNzNlZmM2YTc5NjU5ZjQ1OGJkNTZlMioifV19&Signature=Tl3Ojk%7E4U1kFExCIzNoQtn9JGpfx4sRAGqDo-oA7yvZlGLf2fWDA287g12rExjGU7Qeewq6iqgp-7wwmZWHOMNq96yISkdeIcPrrHZr8xmu762U0aA6XivNGbf3Bn%7Enb0lPcUmBRO1nUEbIq5f0Q7IAohL6DwAnBrEHmsqaX2BQHddYhV9iUGSvNpVKTTFKORbHfvGZER8EINHtUg9anGRIKJF0eSsuMTy7kCEqWJN5BiGoPTFjeMurHgUeLgXQz0wKZ6IqVS8l7MV5yU6yn6VsFMIFsAEb%7EgNN05NijYZ-Q78SqU2tcXcwyoa82rH2j6wrLvBZ398bP-g9QTMSYlA__&Key-Pair-Id=K2L8F4GPSG1IFC)

  - [DomainNet](http://ai.bu.edu/DomainNet/)

> Flower102 needs additional annotation files, which we supported in `./dataset/flower102`. 

> For some dataset (e.g., OxfordPet, Tiny-ImageNet), it is necessary to download the annotations besides images. More details of dataset split in `./dataset`.

### Pre-trained Model

- The following table provides the pre-trained checkpoint.

    <table><tbody>
    <!-- START TABLE -->
    <!-- TABLE HEADER -->
    <th valign="bottom">Pre-trained Model</th>
    <th valign="bottom">Link</th>
    <!-- TABLE BODY -->
    </tr>
    <td align="center">CLIP (ViT-B/32)</td>
    <td align="center"><a href="https://huggingface.co/openai/clip-vit-base-patch32">download</a></td>
    </tr>
    <td align="center">CLIP (ViT-B/16)</td>
    <td align="center"><a href="https://huggingface.co/openai/clip-vit-base-patch16">download</a></td>
    </tr>
    <td align="center">CLIP (ViT-L/14)</td>
    <td align="center"><a href="https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt">download</a></td>
    </tr>
    </tbody></table>

## Experimental Setting

### Data Setting
For b2n class generalization tasks, we design three data settings: `IID`, `Dirichlet`, `Non-IID`.
For cross-domain feature shift generalization tasks, we design three data settings: `IID`, `IID-domain`, `Dirichlet-domain`.
We show them in detail in `FedDTL_utils/utils.py`.

For `few-shot` regime, `args.data_num=few-shot` with `args.shot_num=16`;
For `full-data` regime, `args.data_num=all-shot`


## Baselines

| Baselines  | Original paper                                                                                                                                                   |
|------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `pFedDC`   | [Personalized Federated Learning via Dual-Prompt Optimization and Cross Fusion](https://arxiv.org/abs/2506.21144)                                                    |
| `FedPGP`   | [Harmonizing Generalization and Personalization in Federated Prompt Learning](https://arxiv.org/abs/2405.09771)                                   |
| `pFedMMA`  | [pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models](https://arxiv.org/abs/2507.05394)                                        |
| `PromptFL` | [PromptFL: Let Federated Participants Cooperatively Learn Prompts Instead of Models — Federated Learning in Age of Foundation Model](https://arxiv.org/abs/2208.11625) |
| `FedMaPLe` | [MaPLe: Multi-modal Prompt Learning](https://arxiv.org/abs/2210.03117)                                                                              |


The corresponding codes can see `./baselines`.

## Notes
(1) The above experiments which are related to FedDTL are based on CLIP with the ViT-B/16 Transformer architecture.

(2) Change the `args.IID` to choose different data settings.

(3) FedMaPLe: FedAVG with MaPLe, a novel prompt tuning scheme for CLIP.

**(4) IMPORTANT: Make sure your project paths of the pre-trained parameters, dataset position, hyperparameters, and data partition are RIGHT.**
