# CODA: Contrastive Object-centric Diffusion Alignment

> Combining slot attention with pretrained diffusion models has emerged as a promising direction for advancing object-centric learning (OCL) in complex real-world images. Despite early success, our analysis shows that these models struggle to compose images from a single visual concept, limiting generalization to novel scenes and object configurations. To address this limitation, we propose Contrastive Object-centric Diffusion Alignment (CODA), a simple yet effective method that augments the slot sequence with additional register slots. Unlike prior approaches that rely solely on a denoising objective, where compositionality is implicitly imposed through architectural bias, CODA explicitly enforces slot-image alignment via a contrastive loss. Our objective encourages high log-likelihood for slots aligned with the image and penalizes mismatched slots.

[![GitHub](https://img.shields.io/badge/GitHub-000?logo=github&logoColor=white)](???) [![arXiv](https://img.shields.io/badge/arXiv-???logo=arxiv&logoColor=white)](??)

<center>
<img src="./assets/diagram_coda.png"  width="80%">
</center>


## Installation
The training and evaluation code requires PyTorch. Clone the repository then use `requirements.txt` to install dependencies
```
pip install -r requirements.txt
```


## Data preparation
All datasets will be downloaded and placed at `$USER_DATA`. Run the following command to get the data.

```bash
# define where to store data 
USER_DATA=...
bash preprocess/download.sh voc coco movi-c movi-e
```

## Training
We use the following script for training.
```bash
bash scripts/train.sh <dataset>
```
where `dataset` can be one of [`voc`, `coco`, `movi-c`, `movi-e`].

## Evaluation
We use the following script for evaluation.
```bash
bash scripts/eval.sh <dataset>
```
where `dataset` can be one of [`voc`, `coco`, `movi-c`, `movi-e`].

## Load model
The diffusion pipeline can be loaded as follows.
```python
from src.model.pipeline import DiffusionPipeline

image = <image_tensor>
model_path = <path_to_pretrained_model>
model = DiffusionPipeline.from_pretrained(model_path).to("cuda")

with torch.no_grad():
    slots = model.encoder(image)
    image_rec = model.sample(slots, resolution=512)
```

## Pretrained models

<table style="margin: auto">
<thead>
  <tr>
    <th>Dataset</th>
    <th>Download</th>
  </tr>
</thead>

<tbody>
  <tr>
    <td>MOVi-C</td>
    <td><a href="?">Link</a></td>
  </tr>
  <tr>
    <td>MOVi-E</td>
    <td><a href="?">Link</a></td>
  </tr>
  <tr>
    <td>VOC</td>
    <td><a href="?">Link</a></td>
  </tr>
  <tr>
    <td>COCO</td>
    <td><a href="?">Link</a></td>
  </tr>
</tbody>
</table>


## License

CODA is released under the Apache License 2.0. See the LICENSE file for more details.