# SoftREPA: Aligning Text to Image in Diffusion Models is Easier Than You Think

![img](assets/main-figure.jpg)

## Abstract

❗️ SoftREPA is a lightweight fine-tuning method that improves text-image alignment in text-to-image (T2I) generative models using soft text tokens, adding fewer than 1M trainable parameters.

💡 SoftREPA adopts a contrastive learning approach using denoising score matching logits with both positive and negative pairs to better align image and text representations.

📌 Experiments across generation and editing tasks validate SoftREPA’s effectiveness and efficiency.


## Quick Start

### Environment Setup

First, clone this repository and install requirements.

```
conda create -n softrepa python==3.10
conda activate softrepa
pip install -r requirements.txt
```


## Training and Evaluation
### Training

- Run the `run_train.sh` file. Undo the annotation for each `--model` training (sd3, sd1.5, sdxl). Change the arguments: `--dweight` for diffusion loss weight combined with contrastive loss, `--n_dc_tokens` for number of soft text tokens per layer, `--n_dc_layers` or `--apply_dc` for designating layers to prepend tokens, `--use_dc_t` for t dependency on tokens.
- Specify the `LOGDIR` and `DATADIR` for saving the checkpoints (default setting is `./data`) and text-image pair datapath (COCO).
- Two gpus are required for one for text, image encoding and the other for denoising. 
- All default settings for each model is defined in `run_train.sh`. 
- This allows us to train 4 `BATCHSIZE` (16 batches for contrastive learning) with two gpus of VRAM 40GB in sd3.
```
sh run_train.sh
```

### Training in large batch
- For distributed learning, all text and image encodings will be pre-computed and saved. Please make sure you have enough room for saving those encodings in your HDD before running `run_train_dist.sh`. (COCO dataset: 9.1T for `sd3`, 1.4T for `sdxl`, and 1.1T for `sd1.5` are required.)
- This allows us to train 16 `BATCHSIZE` (256 batches for contrastive learning) with two gpus of VRAM 40GB in sd3.
```
sh run_train_dist.sh
```

### Sampling & Evaluation on T2I Generation
- Specify the `NFE`, `CFG`, `IMGSIZE` for sampling, and set `MODEL`, `USEDCT`, `NTOKENS`, `NLAYERS` same as the training setting. `LOADDIR` for saved token directory. The trained tokens in the paper are provided in `tokens/$MODEL`. 
- All default settings for each model is defined in the file. Please undo the annotation for other models in `run_eval.sh` file.
```
sh run_eval.sh
```