# MMMU Inference and Evaluation Script

This script provides tools to run inference using a Qwen2.5-VL model on the MMMU dataset and evaluate the generated results. It supports both standard inference and Chain-of-Thought (CoT) prompting. Evaluation can be performed using different judge models via DashScope or a custom MIT API endpoint.

## Features

*   **Inference:** Runs inference on  MMMU dataset.
*   **Chain-of-Thought (CoT):** Supports optional CoT prompting to potentially improve reasoning.
*   **Evaluation:** Evaluates model predictions against ground truth using external LLMs (GPT-3.5 Turbo, GPT-4) as judges.
*   **API Support:** Integrates with DashScope and a custom MIT API for evaluation.
*   **Metrics:** Calculates overall accuracy and accuracy per data split.
*   **Parallel Processing:** Uses threading for faster evaluation.

## Prerequisites

*   See `requirements.txt` for specific library dependencies.
*   Qwen2.5-VL Model Checkpoint: Path to the model weights, typically downloaded from HuggingFace
*   (Optional) API Keys/Endpoints for Evaluation:
    *   For DashScope: `CHATGPT_DASHSCOPE_API_KEY` and `DASHSCOPE_API_BASE` environment variables.
    *   For MIT API: `MIT_SPIDER_TOKEN` and `MIT_SPIDER_URL` environment variables.

## Setup

1.  **Clone the Repository:**
    ```bash
    # git clone <repository_url>
    # cd <repository_directory>
    ```

2.  **Install Dependencies:**
    ```bash
    pip install -r requirements.txt
    ```

3.  **Prepare Model:**
    *   Download the Qwen2.5-VL model checkpoint from HuggingFace.

4.  **Set Environment Variables (for Evaluation):**
    *   **DashScope:**
        ```bash
        export CHATGPT_DASHSCOPE_API_KEY="your-dashscope-api-key"
        export DASHSCOPE_API_BASE="your-dashscope-api-base"
        ```
    *   **MIT API:**
        ```bash
        export MIT_SPIDER_TOKEN="your-mit-spider-token"
        export MIT_SPIDER_URL="your-mit-spider-url"
        ```

## Usage

The script operates in two modes: `infer` (for running inference) and `eval` (for evaluating results).

### 1. Inference Mode (`infer`)

This mode runs the Qwen2.5-VL model on the specified MMMU dataset split and saves the predictions.

**Command Structure:**

```bash
python run_mmmu.py infer \
    --model-path <Path_to_your_QwenVL_model> \
    --data-dir <Path_to_save_mmmu_data> \
    --dataset <Dataset_Split_Name> \
    --output-file <Path_to_save_inference_results.jsonl> \
    [--use-cot] \
```

**Arguments:**

*   `--model-path`: (Required) Path to the Qwen2.5-VL model checkpoint directory.
*   `--data-dir`: (Required) Path to the directory to save the MMMU dataset file.
*   `--dataset`: (Optional) Name of the dataset split to use (default: `MMMU_DEV_VAL`).
*   `--output-file`: (Required) Path where the inference results (in JSON Lines format) will be saved.
*   `--use-cot`: (Optional) Flag to enable Chain-of-Thought prompting. If included, the default or custom CoT prompt will be appended to the input.

**Examples:**

*   **Standard Inference:**
    ```bash
    python run_mmmu.py infer \
        --model-path /path/to/Qwen2.5-VL-chat \
        --data-dir /data/mmmu \
        --dataset MMMU_DEV_VAL \
        --output-file results/mmmu_dev_val_predictions.jsonl
    ```

*   **Inference with Chain-of-Thought:**
    ```bash
    python run_mmmu.py infer \
        --model-path /path/to/Qwen2.5-VL-chat \
        --data-dir /data/mmmu \
        --dataset MMMU_DEV_VAL \
        --output-file results/mmmu_dev_val_predictions_cot.jsonl \
        --use-cot
    ```

### 2. Evaluation Mode (`eval`)

This mode takes the inference results (`.jsonl` file) and evaluates them against the ground truth using a specified judge model and API.

**Command Structure:**

```bash
python run_mmmu.py eval \
    --data-dir <Path_to_save_mmmu_data> \
    --input-file <Path_to_inference_results.jsonl> \
    --output-file <Path_to_save_evaluation_results.csv> \
    --dataset <Dataset_Split_Name> \
    --eval-model <Judge_Model_Name> \
    --api-type <API_Type> \
    [--nproc <Num_Processes>]
```

**Arguments:**

*   `--data-dir`: (Required) Path to the save the MMMU dataset file. Used to load ground truth answers.
*   `--input-file`: (Required) Path to the inference results file (`.jsonl`) generated by the `infer` mode.
*   `--output-file`: (Required) Path where the detailed evaluation results (in CSV format) will be saved. 
*   `--dataset`: (Optional) Name of the dataset split used during inference (default: `MMMU_DEV_VAL`). Must match the dataset used to generate the `--input-file`.
*   `--eval-model`: (Optional) The judge model to use for evaluation (default: `gpt-3.5-turbo-0125`). Choices depend on the API type. Examples: `gpt-3.5-turbo-0125`, `gpt-4-0125-preview`.
*   `--api-type`: (Optional) The API service to use for the judge model (default: `dash`). Choices: `dash` (DashScope), `mit` (Custom MIT API).
*   `--nproc`: (Optional) Number of parallel processes to use for evaluation (default: 4).

**Examples:**

*   **Evaluation using DashScope (GPT-3.5 Turbo):**
    *(Ensure `CHATGPT_DASHSCOPE_API_KEY` and `DASHSCOPE_API_BASE` are set)*
    ```bash
    python run_mmmu.py eval \
        --data-dir /data/mmmu \
        --input-file results/mmmu_dev_val_predictions.jsonl \
        --output-file results/mmmu_dev_val_evaluation.csv \
        --dataset MMMU_DEV_VAL \
        --eval-model gpt-3.5-turbo-0125 \
        --api-type dash \
        --nproc 8
    ```

*   **Evaluation using MIT API (GPT-4):**
    *(Ensure `MIT_SPIDER_TOKEN` and `MIT_SPIDER_URL` are set)*
    ```bash
    python run_mmmu.py eval \
        --data-dir /data/mmmu \
        --input-file results/mmmu_dev_val_predictions_cot.jsonl \
        --output-file results/mmmu_dev_val_evaluation_cot_gpt4.csv \
        --dataset MMMU_DEV_VAL \
        --eval-model gpt-4-0125-preview \
        --api-type mit \
        --nproc 4
    ```

## Output Files

*   **Inference (`--output-file` in `infer` mode, e.g., `results.jsonl`):**
    A JSON Lines file where each line corresponds to one sample from the dataset. Each line is a JSON object containing keys like `question_id`, `annotation` (original data including question, options, answer, image paths), `task`, `result` (with the model's generated response under `gen`), and `messages` (the prompt sent to the model).

*   **Evaluation (`--output-file` in `eval` mode, e.g., `evaluation_results.csv`):**
    A CSV file containing detailed results for each evaluated sample. Columns typically include `index`, `question`, `prediction` (model's parsed answer), `GT` (ground truth answer), `judge_output`, `hit` (1 if correct, 0 if incorrect), `split`, etc.

*   **Accuracy Summary (`<eval_output_file_prefix>_acc.json`):**
    A JSON file saved alongside the evaluation CSV, containing the `overall_accuracy` and a dictionary `accuracy_by_split` showing accuracy for different data splits within MMMU.
