# Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

Official repository for *Kaleidoscope*, **a comprehensive multilingual multimodal exam benchmark** evaluating VLMs across:
- **18 languages** (Bengali → Spanish)
- **14 subjects** (STEM to Humanities) 
- **20,911 questions** (55% requiring image understanding)

<p align="center">
  <img src="images/overview.png" width="80%" alt="Kaleidoscope benchmark overview">
</p>

## 🚀 Quick Start
```python
from datasets import load_dataset
dataset = load_dataset("Anonym-sub/Kaleidoscope")
```

## Environment Setup

We use Docker to ensure a consistent environment. To build and run the Docker container, use the following commands:

```sh
docker build --progress=plain -t eval .
docker run --rm -it --gpus all --shm-size=16g -v "$(pwd)":/eval -w /eval/ eval
```

With `sudo` if needed:

```sh
sudo docker build --progress=plain -t eval .
sudo docker run --rm -it --gpus all --shm-size=16g -v "$(pwd)":/eval -w /eval/ eval
```


## 🛠️ Evaluation
### Running Inference 

Make sure you have the "data" folder within your working directory. You can download the .zip file from [here](https://huggingface.co/datasets/Anonym-sub/Kaleidoscope/tree/main). Extract the contents of the .zip file into the "data" folder.

```sh
wget https://huggingface.co/datasets/Anonym-sub/Kaleidoscope/resolve/main/data.zip
unzip data.zip
mv final_data/* .
rm data.zip
```

Run the script:

```sh
# Inference
python main.py \
--model <model_name> \
--dataset <dataset_name_or_path> \
--model_path <model_path> \
--api_key <api-key-if-needed>

# Answer Extraction
python format_answer.py --results_path <prediction_json> --save_path <extraction_json>

# Final scores
python get_score.py <extraction_json>
```

The supported models are:

```python
SUPPORTED_MODELS = [
    # Local Models
    ## vLLM
    "qwen2.5-3b",
    "qwen2.5-7b",
    "qwen2.5-32b",
    "qwen2.5-72b",
    "molmo",
    ## Hugging Face
    "pangea"
    "qwen2-7b",

    # API Models
    "gpt-4o",
    "gemini-1.5-pro",
    "gemini-1.5-flash",
    "claude-3-5-sonnet-latest",
    "aya-vision",
]
```

For local models, you can specify the Hugging Face model path using the `--model_path` argument. For API models, `--model_path` argument  indicated the model identifier. For API models, you also need to provide the API key using the `--api_key` argument.

For example, to run the evaluation on the `qwen2.5-3b` model on the `Kaleidoscope` dataset, you can run:

```sh
CUDA_VISIBLE_DEVICES=0 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
python main.py \
--model qwen2.5-3b \
--dataset Anonym-sub/Kaleidoscope \
--model_path Qwen/Qwen2.5-VL-3B-Instruct

python format_answer.py --results_path outputs/zero-shot/model_qwen2.5-3b/results.json --save_path outputs/zero-shot/model_qwen2.5-3b/results_format.json

python get_score.py ./outputs/zero-shot/model_qwen2.5-3b/results_format.json
```

To run the evaluation on the `aya-vision` model on the `Kaleidoscope` dataset, you can run:

```sh
python main.py \
--model aya-vision \
--dataset Anonym-sub/Kaleidoscope \
--model_path c4ai-aya-vision-8b \
--api_key <your_api_key>

python format_answer.py --results_path outputs/zero-shot/model_aya-vision/results.json --save_path outputs/zero-shot/model_aya-vision/results_format.json

python get_score.py ./outputs/zero-shot/model_aya-vision/results_format.json
```