### Introduction

This repository contains implements the training procedure introduced in: "Reversed Stable Diffusion: What prompt was
used to generate this image?" on the image-to-text-embedding task.

### Prerequisites

This code expects a data set of image and text pairs, stored as follows:
```bash
|root_dir
  |images_part1
    |images
      |000000.png
  ...
  |images_part8
  |sentence_embeddings
    000000.npy
  metadata.csv
```
where ```sentence_embeddings``` is a directory and stores the target embeddings obtained from a sentence transformer

Moreover, it requires a vocabulary for multi-label classification. The script ```compute_vocab.py``` does this
### Train models

We have two scripts for each model to perform the training. The first one runs the vanilla training process, while 
the second one runs the curriculum learning procedure.

A special case is the U-Net because it expects the captions and it also works in latent space of SD, thus it requires
a preliminary step to map the images in this latent space(not included in the repo).
