<div align="center">
<h1>UniTok: A Unified Tokenizer <br> for Visual Generation and Understanding</h1>
</div>

This repo implements UniTok, a unified visual tokenizer well-suited for both generation and understanding tasks. 
It is compatiable with autoregressive generative models (e.g. LlamaGen), 
multimodal understanding models (e.g. LLaVA), and unified MLLMs (e.g. Chameleon and Liquid).

![teaser](assets/teaser.png)

Built upon UniTok, we construct an MLLM capable of both multimodal generation and understanding
with the [Liquid](https://github.com/FoundationVision/Liquid/) framework,
which sets a new state-of-the-art among unified autoregressive MLLMs.

![teaser](assets/samples.png)


## Usage

### Requirements
- Python ≥ 3.10
- PyTorch ≥ 2.3.1

### Installation

```bash
pip install -r requirements.txt
```


### Training

- We train UniTok on [DataComp-1B](https://github.com/mlfoundations/datacomp). 
Please follow the [instructions](https://github.com/mlfoundations/datacomp?tab=readme-ov-file#downloading-datacomp-1b) to download and prepare the data.

- Download the external models used for loss calculation and put them under `./external`.

- Download the [ImageNet validation set](https://www.image-net.org/) for zero-shot accuracy evaluation.

- Download the ImageNet 256$\times$256 reference batch for FID evaluation.

Configure `nnodes, nproc_per_node, node_rank, master_addr, master_port` in `launch.sh` and run:

```bash
bash launch.sh \
    --output_dir '/path/to/save/checkpoints/' \
    --train_data '/path/to/datacomp/shards/{00000000..00140146}.tar' \
    --imagenet_val '/path/to/imagenet_val/' \
    --fid_eval_src '/path/to/imagenet_reference_batch' \
    --fid_eval_dst '/path/to/save/imagenet_reconstructed_batch'
```
**Note:** For more hyper-parameter configurations, please check `utils/config.py`.

### Evaluation

We benchmark UniTok in terms of both understanding performance using the [LLaVA](https://github.com/haotian-liu/LLaVA) framework, 
generation performance using the [LLamaGen](https://github.com/FoundationVision/LlamaGen) framework, and unified performance using the [Liquid](https://github.com/FoundationVision/Liquid/) framwork.
Please refer to [EVAL.md](eval/EVAL.md) for more details.


## Acknowledgement
UniTok is built upon the awesome works
[VAR](https://github.com/FoundationVision/VAR),
[DataComp](https://github.com/mlfoundations/datacomp),
[Liquid](https://github.com/FoundationVision/Liquid/),
[LLaVA](https://github.com/haotian-liu/LLaVA/),
[LlamaGen](https://github.com/FoundationVision/LlamaGen/),
and [ViTamin](https://github.com/Beckschen/ViTamin).

