# Transfer between Modalities with MetaQueries

## Installation
```bash
conda env create -f environment.yml
conda activate metaquery
```

## Training
If you want to train the model on a single node, you can use the following command.
- `run_name` is the name that appears in the checkpoint path and wandb.
- `config_file` is the path to the yaml file that contains the training configs. You can find the provided configs [here](configs). If you want to specify the configs directly in the command line, you can also skip the `--config_file` argument.
- `base_dir` is the path to the directory where you wish to save data and checkpoints.

```bash
OMP_NUM_THREADS=12 torchrun --nproc-per-node=8 train.py \
    --run_name test \
    --config_file llavaov0p5_sana.yaml \
    --base_dir /path/to/metaquery
```

> **Tips**: To speed up the data downloading, you can try to run the following command first to download the data in parallel (e.g., 64 threads), then switch to the regular training command above.
>
> ```bash
> OMP_NUM_THREADS=64 torchrun --nproc-per-node=1 train.py \
>     --run_name test \
>     --config_file llavaov0p5_sana.yaml \
>     --base_dir /path/to/metaquery
> ```

> **Note**: For text-to-image pretraining, we only provide the code for [cc12m](https://huggingface.co/datasets/pixparse/cc12m-wds) since it can be loaded directly with the [datasets](https://github.com/huggingface/datasets) package. Using this dataset alone cannot guarantee the same performance as reported in the paper.

If you wish to train the model on multiple nodes, we also provide a sample SLURM script [here](run_slurm.sh) for reference.

For the edit and instruction tuning training, you may need to also specify the `--resume_from_checkpoint` argument to resume from the previous checkpoint.

## Demo
When you have the checkpoint ready, you can run the following command to start the demo:
```bash
python app.py --checkpoint_path /path/to/checkpoint
```

## Evaluation
For evaluation, please follow the instructions [here](eval/EVALUATION.md).

## MetaQuery Instruction Tuning Data (2.4M)

In this work, we collect an instruction tuning dataset MetaQuery-Instruct-2.4M. We group images from web corpora based on caption similarity, then construct instruction-tuning data from these image pairs using an MLLM.

We provide the dataset curation code [here](curate_dataset.py) for reference. The dataset is curated from [mmc4](https://huggingface.co/datasets/mmc4).

After tuning on the MetaQuery-Instruct-2.4M dataset, the model achieves impressive zero-shot subject-driven generation performance (the first row) and surprisingly unlocks novel capabilities like visual association and logo design that go beyond copy-pasting (the second row).
