
# MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning

## Installation

```bash
conda create -n milr python=3.10
conda activate milr
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

#install Geneval configs
#You may meet package counters, it doesn't matter
pip install -U openmim
mim install mmengine mmcv-full==1.7.2 

cd src/geneval
./evaluation/download_models.sh "<OBJECT_DETECTOR_FOLDER>/"
git clone https://github.com/open-mmlab/mmdetection.git
cd mmdetection; git checkout 2.x
pip install -v -e .   

cd ../rewards
./evaluation/download_models.sh "<OBJECT_DETECTOR_FOLDER>/"
```

## Usage
We support different kinds of reward types. And we release the main code on **Geneval** benchmark, T2I-Compench and WISE will be released after the review

### Geneval

```bash
cd src
bash scripts/geneval_both.sh
```

The bash file

```bash
#!/bin/bash

PATH_TO_DATA="prompts/geneval/evaluation_metadata.jsonl"
PATH_TO_MODEL="deepseek-ai/Janus-Pro-7B"
output_dir="./geneval_results/geneval_both" #you should self create the dir
optimize_mode="both"
reward_model_type="geneval"
text_k=0.2 
image_k=0.02 
lr=0.03
max_text_steps=20
max_image_steps=20
max_both_steps=20

# === set log file name ===
if [ "$optimize_mode" = "text" ]; then
    LOG_FILE="$output_dir/${optimize_mode}_tk${text_k}_lr${lr}_ts${max_text_steps}.txt"
elif [ "$optimize_mode" = "image" ]; then
    LOG_FILE="$output_dir/${optimize_mode}_ik${image_k}_lr${lr}_is${max_image_steps}.txt"
else
    LOG_FILE="$output_dir/${optimize_mode}_tk${text_k}_ik${image_k}_lr${lr}_bs${max_both_steps}.txt"
fi

# === starting script ===
CUDA_VISIBLE_DEVICES=0 python main_janus.py \
    --dataset "$PATH_TO_DATA" \
    --model_name_or_path "$PATH_TO_MODEL" \
    --output_dir "$output_dir" \
    --optimize_mode "$optimize_mode" \
    --reward_model_type "$reward_model_type" \
    --lr "$lr" \
    --text_k "$text_k" \
    --image_k "$image_k" \
    --max_text_steps "$max_text_steps" \
    --max_image_steps "$max_image_steps" \
    --max_both_steps "$max_both_steps" \
    --device "cuda" \
    > "$LOG_FILE" 2>&1 &
```
- `output_dir`: the folder of saved results, you should create it
- `optimize_mode`: The mode of optimization, you can choose from `both`, `image` or `text`.
- `reward_model_type`: the reward model used for optimize, you can check in the main_janus.py file
- `text_k`: the ratio of text tokens for optimization
- `image_k`: the ratio of image tokens for optimization
- `lr`: the learning rate
- `max_text_steps`: the steps of text optimization
- `max_image_steps`: the steps of image optimization
- `max_both_steps`: the steps of both optimization

### Analysis
You can set different text_k or image_k to study the influence of ratio ,and you can set different steps to observe the test time scaling with steps. 

### Different Rewards
For `SelfReward`, you can run the script:
```bash
bash scripts/self_reward.sh 
```
For `UnifiedReward`, you can run the script:
```bash
bash scripts/unified_reward_geneval.sh 
```
remeber you download the `CodeGoat24/UnifiedReward-qwen-7b` model.

For `MixedReward`, you can run the following script:
```bash
cd rewards/MixedReward
git clone https://github.com/IDEA-Research/GroundingDINO.git
mkdir reward_weights
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
huggingface-cli download microsoft/git-large-vqav2 --repo-type model --local-dir git-large-vqav2
cd ../..
bash scripts/unified_reward_geneval.sh 
```

For `GPT-4o` reward, you should first set the api key in `main_janus.py`, and run the following script:
```bash
bash scripts/geneval_gpt4o_reward.sh 
```
