## Overview

We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity.<br>
Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels.<br>
We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 → 3.03, 2.57 → 2.44, 2.09 → 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models will be released.<br>


## Installation

### Create environment for NVG training and sampling

```bash
conda create -n nvg-test python=3.9
conda activate nvg-test
pip install -r requirements.txt
# only run the following command when you face issues related to libgl
# conda install -c conda-forge libgl
```

### Evaluation environment ([OpenAI's evaluation tool](https://github.com/openai/guided-diffusion/tree/main/evaluations))

We recommend creating a separate conda environment named `oai-eva`:

```bash
python>=3.9
tensorflow-gpu>=2.0
scipy
requests
tqdm
```

## Training and Evaluation

### Notes

Use pytorch compile mode for faster training and evaluation with a little extra cost at initialization. Cancelled it via:
```
USE_TORCH_COMPILE=0
```
We also set WANDB for logging training. Set it in your environmnet.
```
WANDB_API_KEY
```
The training configs assume a single-machine-single-GPU setting, change it in the config files for your convenience.

### Auto-Encoder

**Training**

```bash
USE_TORCH_COMPILE=0 WANDB_MODE='offline' torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=distributed_alldata --rdzv_backend=c10d --rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT main.py -t True --base configs/downsample_configs/vq-f16-d32.yaml --enable_tf32 True
```

**Evaluation**

```bash
USE_TORCH_COMPILE=0 python eval_vq.py --config_file configs/downsample_configs/vq-f16-d32.yaml --ckpt_path ckpt/ae.ckpt
```

---

### Generator

**Training**


Configs for three variants with different depth are named as d16, d20, d24, respectively. Change it for your convenience.
```bash
USE_TORCH_COMPILE=0 WANDB_MODE='offline' torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=distributed_alldata --rdzv_backend=c10d --rdzv-endpoint=$MASTER_ADDR:$MASTER_PORT main.py -t True --scale_lr --base configs/generator_configs/d16.yaml
```

**Sampling**

```bash
export BATCH_SIZE=250
export SAMPLES_PER_CLASS=50
export STRUCTURE_SAMPLING_STEP=25
export CONTENT_CFG=1-3.5
export STRUCTURE_CFG=1-2.5
export TOP_P=1-0.5
python eval_gen.py --eval_ema --batch_size $BATCH_SIZE --config_file configs/generator_configs/d16.yaml --ckpt_path ckpt/d16.ckpt --content_cfg_scale=$CONTENT_CFG --structure_cfg_scale=$STRUCTURE_CFG --top_p=$TOP_P --samples_per_class=$SAMPLES_PER_CLASS --structure_sampling_step=$STRUCTURE_SAMPLING_STEP --sample_dir gen_results/d16
```

**Evaluation**
Download the reference batch from [here](https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/VIRTUAL_imagenet256_labeled.npz) provided by OpenAI.

```bash
conda activate oai-eva
python metrics/evaluator.py ckpt/VIRTUAL_imagenet256_labeled.npz gen_results/d16.npz
```

## Acknowledgements

We thank the following open-sourcing projects:

[VAR](https://github.com/FoundationVision/VAR)<br>
[Infinity](https://github.com/FoundationVision/Infinity)<br>
[FLUX](https://github.com/black-forest-labs/flux)<br>
[SEED-Voken](https://github.com/TencentARC/SEED-Voken)<br>
