# Evaluation Data Workflow

1. Download [CapsFusion-120M](https://huggingface.co/datasets/BAAI/CapsFusion-120M) and [BLIP-CapFilter](https://github.com/salesforce/BLIP#pre-training-datasets-download)
2. `extract_noun_phrases.py`: extract noun phrases from the text.
3. `filter_and_sample.py`: filter and sample the image-caption pairs following [LLaVA](https://arxiv.org/abs/2304.08485) Appendix E.
4. `download_images.py`, `download_images_lca.py`: download images via [img2dataset](https://github.com/rom1504/img2dataset)
5. Download [JourneyDB](https://huggingface.co/datasets/JourneyDB/JourneyDB) (see `JourneyDB.py`) and [laion-coco-aesthetic](https://huggingface.co/datasets/guangyil/laion-coco-aesthetic)
6. `tokenize_image.py`: tokenize images via LaVIT.
7. `update_JourneyDB.py`: clean JourneyDB annotaions.
8. `extract_caption.py`: extract captions of images which comes from BLIP-CapFilter (without capsfusion caption) and laion-coco-aesthetic.
9. `capsfusion_vllm_multi_gpu.py`: generate capsfusion caption for BLIP-CapFilter images and laion-coco-aesthetic images.

## Commands

Tokenize images:
```shell
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/IC/tokenize_images.py --dataset_name Merged --dataset_shard_index 0

accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/IC/tokenize_images.py --dataset_name Merged_new --dataset_shard_index 6

accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/IC/tokenize_images.py --dataset_name JourneyDB

accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/IC/tokenize_images.py --dataset_name "laion-coco-aesthetic" --dataset_shard_index 0
```

Generate captions:
```shell
python data/IC/extract_caption.py --dataset_name Merged_new
python data/IC/extract_caption.py --dataset_name "laion-coco-aesthetic"
```

Tokenizer text
```shell
bash scripts/data/Merged_new_tokenize_texts_ic.sh
bash scripts/data/LCA_tokenize_texts_ic.sh
bash scripts/data/JourneyDB_tokenize_texts_ic.sh

```
