# Data Provenance in IARS - Infinity

##  Installation

###  Environment

Set up an environment as follows by installing the requirements.txt file.

All experiments have been conducted on GPT-L (c2i_L_256) with the vae VQ-16 (vq_ds16_c2i). These model weights (and other) can be downloaded from Hugginface: https://huggingface.co/FoundationVision/LlamaGen/

If required, further instructions can be obtained from the [official LlamaGen repository](https://github.com/FoundationVision/LlamaGen).

## Files & Execution

Our main scripts can be found in codebook.py, finetune_vae.py, latenttracer.py, autoregressive/sample/sample_c2i.py

### Data Generation

Generating data and preapring for it for finetuning can be done by executing autoregressive/sample/sample_c2i.py:

```bash
PYTHONPATH=. python3 ./autoregressive/sample/sample_c2i.py \
    --vq-ckpt /path/to/vae \
    --gpt-model GPT-L \
    --vq-model VQ-16 \
    --gpt-ckpt /path/to/GPT \
    --image-size 256 \
    --save_folder ./samples \
    --num_samples 10 # How many samples per class are be generated
```

### Finetuning

For finetuning, execute finetune.py:
```bash
PYTHONPATH=. python3  finetune_vae.py \
    --preprocessed-dir ./samples \
    --vq-model VQ-16 \
    --vq-ckpt /path/to/vae \
    --codebook-size 16384 \
    --codebook-embed-dim 8 \
    --num-workers 16 \
    --finetune-lr 1e-5 \
    --epochs 50 \
    --batch-size 8 \
    --grad-clip 0 \
    --seed 42 \
    --save-interval 1 \
    --preprocessed-dir \
    --output-dir ./checkpoints \
    --log-file ./training_loss.log \
```


### Data Provenance - Losses Calculation

For calculating our losses and baselines, execute tools/codebook.py:
```bash
PYTHONPATH=. python3 codebook.py \
    --vq-model VQ-16 \
    --vq-ckpt /path/to/vae \
    --codebook-size 16384 \
    --codebook-embed-dim 8 \
    --seed 42 \
    --ft_path /path/to/inverse_decoder \
    --dataset_config ./dataset_config.json \
    --num_samples 1000 \
```

### Latenttracer Baseline 
 
To calculate results with latentteacer, execute tools/latenttracer.py:
```bash
PYTHONPATH=. python3 latenttracer.py \
    --vq-model VQ-16 \
    --vq-ckpt /path/to/vae \
    --codebook-size 16384 \
    --codebook-embed-dim 8 \
    --seed 42 \
    --num_samples 1000 \
    --batch_size 2 \
    --save_folder path/to/folder \
    --dataset_config ./dataset_config.json \
 ```
    
### Setup Datasets

Datasets are supposed to be passed via a separate dataset_config.json file with the following format:
```json
{
    "Name": "Path to dataset"
}
```

## Acknowledgements

The fine-tuning code is partly derived from the Official Implementation of [IndexMark](https://github.com/maifoundations/IndexMark).

