# Evaluation Data Workflow

1. `tokenize_images.py`: Tokenize the images (from files/PIL) by LaVIT.
2. `tokenize_texts.py`: Tokenize the texts by LLaMA2 and construct the final data.


## Utils

1. `convert_tsv.py`: Convert VLMEvalKit tsv files to HF datasets and save images to disk for image tokenization.
2. `convert_tsv_mmbench.py`: Convert VLMEvalKit tsv files (mmbench_series) to HF datasets and save images to disk for image tokenization.
3. `build_datasets_from_hf.py`: download and build datasets from HF datasets.

## Notes
1. `HallusionBench`: some data has no images
2. `MMMU`: some data has many images

## Commands

Prepare data:
```shell
python data/Evaluation/convert_tsv.py
python data/Evaluation/convert_tsv_mmbench.py
python data/Evaluation/build_datasets_from_hf.py --download_datasets
```

Tokenize images:
```shell
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name MME
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name POPE --dataset_type PIL
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name Winoground-YN --dataset_type PIL

accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name MMBench_DEV_EN
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name MMBench_TEST_EN
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name ScienceQA_VAL
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name ScienceQA_TEST
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name SEEDBench_IMG
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name HallusionBench
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name MMVet
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name MMMU_VAL_MultiChoice --dataset_type PIL
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name MathVista_MultiChoice --dataset_type PIL

accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name COCO --dataset_type PIL
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name Flickr30K --dataset_type PIL
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name NoCaps --dataset_type PIL
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name WHOOPS-Caption --dataset_type PIL
# cp -r YOUR_ROOT_PATH/data/MLLM/Evaluation/WHOOPS-Caption/image_token YOUR_ROOT_PATH/data/MLLM/Evaluation/WHOOPS-VQA/image_token
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name VQAv2_VAL --dataset_type PIL
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name VQAv2_TEST --dataset_type PIL
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name OK-VQA --dataset_type PIL
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name VizWiz_VAL --dataset_type PIL
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name VizWiz_TEST --dataset_type PIL
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name TextVQA --dataset_type PIL
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name GQA_TESTDEV_BALANCED
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name MMMU_VAL_OpenEnded --dataset_type PIL
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --dataset_name MathVista_OpenEnded --dataset_type PIL

accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --output_dir "YOUR_ROOT_PATH/data/MLLM/Evaluation/Few-shot" --dataset_name COCO
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --output_dir "YOUR_ROOT_PATH/data/MLLM/Evaluation/Few-shot" --dataset_name Flickr30K
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --output_dir "YOUR_ROOT_PATH/data/MLLM/Evaluation/Few-shot" --dataset_name NoCaps
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --output_dir "YOUR_ROOT_PATH/data/MLLM/Evaluation/Few-shot" --dataset_name WHOOPS-Caption
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --output_dir "YOUR_ROOT_PATH/data/MLLM/Evaluation/Few-shot" --dataset_name VQAv2
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --output_dir "YOUR_ROOT_PATH/data/MLLM/Evaluation/Few-shot" --dataset_name OK-VQA
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --output_dir "YOUR_ROOT_PATH/data/MLLM/Evaluation/Few-shot" --dataset_name TextVQA
accelerate launch --main_process_port "39800" --num_processes "8" --config_file configs/bf16_inference.yaml data/Evaluation/tokenize_images.py --output_dir "YOUR_ROOT_PATH/data/MLLM/Evaluation/Few-shot" --dataset_name VizWiz


```

Tokenizer text
```shell
### Generate all for zero-shot, zero-shot cot, zero-shot cod, few-shot
# Caption
python data/Evaluation/tokenize_texts.py --from_hf --dataset_name COCO --prompt_settings 'zero_shot' --num_templates "1" ; python data/Evaluation/tokenize_texts.py --from_hf --dataset_name Flickr30K --prompt_settings 'zero_shot' --num_templates "1" ; python data/Evaluation/tokenize_texts.py --from_hf --dataset_name NoCaps --prompt_settings 'zero_shot' --num_templates "1" ; python data/Evaluation/tokenize_texts.py --from_hf --dataset_name "WHOOPS-Caption" --prompt_settings 'zero_shot' --num_templates "1"

# VQA
python data/Evaluation/tokenize_texts.py --from_hf --dataset_name VQAv2_VAL --prompt_settings "zero_shot,zero_shot_cot,zero_shot_cod" --num_templates "8,8,8" ; python data/Evaluation/tokenize_texts.py --from_hf --dataset_name VQAv2_TEST --prompt_settings "zero_shot,zero_shot_cot,zero_shot_cod" --num_templates "8,8,8" ; python data/Evaluation/tokenize_texts.py --from_hf --dataset_name OK-VQA --prompt_settings "zero_shot,zero_shot_cot,zero_shot_cod" --num_templates "8,8,8" ; python data/Evaluation/tokenize_texts.py --from_hf --dataset_name TextVQA --prompt_settings "zero_shot,zero_shot_cot,zero_shot_cod" --num_templates "8,8,8" ; python data/Evaluation/tokenize_texts.py --from_hf --dataset_name VizWiz_VAL --prompt_settings "zero_shot,zero_shot_cot,zero_shot_cod" --num_templates "8,8,8" ; python data/Evaluation/tokenize_texts.py --from_hf --dataset_name VizWiz_TEST --prompt_settings "zero_shot,zero_shot_cot,zero_shot_cod" --num_templates "8,8,8" ; python data/Evaluation/tokenize_texts.py --dataset_name GQA_TESTDEV_BALANCED --prompt_settings "zero_shot,zero_shot_cot,zero_shot_cod" --num_templates "8,8,8" ; python data/Evaluation/tokenize_texts.py --from_hf --dataset_name MMMU_VAL_OpenEnded --prompt_settings "zero_shot,zero_shot_cot,zero_shot_cod" --num_templates "8,8,8" ; python data/Evaluation/tokenize_texts.py --from_hf --dataset_name MathVista_OpenEnded --prompt_settings "zero_shot,zero_shot_cot,zero_shot_cod" --num_templates "8,8,8"

# Y/N
python data/Evaluation/tokenize_texts.py --dataset_name MME --prompt_settings "zero_shot_choices_ppl,zero_shot_cot_choices_ppl,zero_shot_cod_choices_ppl" --num_templates "8,8,8" ; python data/Evaluation/tokenize_texts.py --from_hf --dataset_name POPE --prompt_settings "zero_shot_choices_ppl,zero_shot_cot_choices_ppl,zero_shot_cod_choices_ppl" --num_templates "8,8,8" ; python data/Evaluation/tokenize_texts.py --dataset_name HallusionBench --prompt_settings "zero_shot_choices_ppl,zero_shot_cot_choices_ppl,zero_shot_cod_choices_ppl" --num_templates "8,8,8" ; python data/Evaluation/tokenize_texts.py --from_hf --dataset_name Winoground-YN --prompt_settings "zero_shot_choices_ppl,zero_shot_cot_choices_ppl,zero_shot_cod_choices_ppl" --num_templates "8,8,8"

# MC
python data/Evaluation/tokenize_texts.py --dataset_name MMBench_DEV_EN --prompt_settings "zero_shot_choices_ppl,zero_shot_cot_choices_ppl,zero_shot_cod_choices_ppl" --num_templates "8,8,8" ; python data/Evaluation/tokenize_texts.py --dataset_name ScienceQA_VAL --prompt_settings "zero_shot_choices_ppl,zero_shot_cot_choices_ppl,zero_shot_cod_choices_ppl" --num_templates "8,8,8" ; python data/Evaluation/tokenize_texts.py --dataset_name ScienceQA_TEST --prompt_settings "zero_shot_choices_ppl,zero_shot_cot_choices_ppl,zero_shot_cod_choices_ppl" --num_templates "8,8,8" ; python data/Evaluation/tokenize_texts.py --dataset_name SEEDBench_IMG --prompt_settings "zero_shot_choices_ppl,zero_shot_cot_choices_ppl,zero_shot_cod_choices_ppl" --num_templates "8,8,8" ; python data/Evaluation/tokenize_texts.py --from_hf --dataset_name MMMU_VAL_MultiChoice --prompt_settings "zero_shot_choices_ppl,zero_shot_cot_choices_ppl,zero_shot_cod_choices_ppl" --num_templates "8,8,8" ; python data/Evaluation/tokenize_texts.py --from_hf --dataset_name MathVista_MultiChoice --prompt_settings "zero_shot_choices_ppl,zero_shot_cot_choices_ppl,zero_shot_cod_choices_ppl" --num_templates "8,8,8"

###
# Text-only benchmark
python data/Evaluation/tokenize_texts.py --dataset_name MMLU --from_hf --prompt_settings "zero_shot_cot_choices_ppl" --num_templates "8"
python data/Evaluation/tokenize_texts.py --dataset_name PIQA_VAL --from_hf --prompt_settings "zero_shot_cot_choices_ppl" --num_templates "8"
```