# Text-to-Image 

Text-to-Image experiments using Flow Matching and MMDiT architecture. Different training techniques will be evaluated to see if there are some improvements in quality and training speed. The techniques will be vanilla training, REPA training, REPA-E training and HASTE training. Classifier free guidande will also be evaluated. 

### Data  
The experiments will be done using two different datasets, MSCOCO 2017 with around 100k instances and CC3M with around 2.9M pairs of images and text prompts. 

### Models  
The model architecture used in all the experiments will be MMDiT. Different modifications of the architecture parameters can be evaluated also. 

### Techniques   
The training techniques evaluated here are the ones mentioned in the introduction. 

* **Vanilla Training**: Standard training pipeline where noisy latents with the text embedding of the prompt are introduced in the model where the loss is the MSE between the predicted and real velocity field. Assume we have a dataset $\mathcal{D}$ with pairs of images and prompts $(X, c)$. The probability path is defined using the Conditional Optimal Transport path where $X_t = t X_1 + (1-t) X_0$ where $X_0\sim p_0 = \mathcal{N}(0,I)$ and $X_1\sim p_{data}$. In this case, the loss function is defined as $$\mathcal{L}_{Vanilla}(\theta) = \mathbb{E}_{t,X_0,X_1}||u_t^{\theta}(X_t) - (X_1-X_0)||^2$$     

* **REPA**: This technique aligns patch-wise projections of the model’s hidden states with pretrained self-supervised visual representations. REPA achieves alignment through a maximization of patch-wise similarities between the pretrained representation $y_*$ and the hidden state $h_t$: $$\mathcal{L}_{REPA}(\theta,\phi) = \mathcal{L}_{Vanilla} - \mathbb{E}_{x_*,\epsilon,t}[\frac{1}{N}\sum_{n=1}^N sim(y_*^{[n]}, h_{\phi}(h_t^{[n]}))]$$ where $h_{\phi}(h_t)$ is a projection head which process the hidden state $h_t$ to match the dimension of the pre-trained encoder features $y_*$. 

* **REPA-E**:   
* **HASTE**:   

### Experiments with MSCOCO 2017 
| Model | Technique | Encoder | Steps - Epochs - BatchSize | Status | COCO Val 2014 Results | Path (*) |
|-----------|-------------|--------|--------|--------|--------|--------|
| MMDiT | HASTE (Attn_Coef: 0.5 / Proj_Coef: 0.5 / Stop: 150k) | DINOv2-B | 400k - 870 - 256 | ✅ Done | FID: 4.60 / ClipScore: 17.14 | MMDiT-HASTE-Orig-CFG-COCO |  
| MMDiT | REPA (Proj_Coef: 0.5) | DINOv2-B | 400k - 870 - 256 | ✅ Done | FID: 4.51 / ClipScore: 17.14 | MMDiT-REPA-Orig-CFG-COCO | 
| MMDiT | Vanilla | DINOv2-B | 400k - 870 - 256 | ✅ Done | FID: 4.44 / ClipScore: 17.16 | MMDiT-Vanilla-Orig-CFG-COCO |  

(*) The base path of the models is `/gpfs/projects/bsc70/bsc193242/t2i_models/`   

### Experiments with CC3M  
| Model | Technique | Encoder | Steps - Epochs - BatchSize | Status | COCO Val 2014 Results | Path (*) |
|-----------|-------------|--------|--------|--------|--------|--------|
| MMDiT | Vanilla | DINOv2-B | 400k - 35 - 256 | ✅ Done | FID: 22.47 / ClipScore: 17.21 | MMDiT-Vanilla-CFG-CC3M |
| MMDiT | REPA (Proj_Coef: 0.5) | DINOv2-B | 400k - 35 - 256 | ✅ Done | FID: 22.21 / ClipScore: 17.21 | MMDiT-REPA-CFG-CC3M |
| MMDiT | HASTE (Attn_Coef: 0.5 / Proj_Coef: 0.5 / Stop: 250k) | DINOv2-B | 400k - 35 - 256 | ✅ Done | FID: 22.12 / ClipScore: 17.21 | MMDiT-HASTE-CFG-CC3M |

(*) The base path of the models is `/gpfs/projects/bsc70/bsc193242/t2i_models/`    

### References  
[1] Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, & Saining Xie. (2025). Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think.     
[2] Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, & Liang Zheng. (2025). REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers.   
[3] Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, Kai Wang, & Yang You. (2025). REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training.  