# SpatialViz-Bench

This project is designed to evaluate the performance of multimodal large language models on the SpatialViz-Bench benchmark.

## Table of Contents

* [Installation](#installation)
* [Configuration](#configuration)
* [Running Evaluations](#running-evaluations)

## Installation

1. **Clone the Repository (if applicable):**
   ```bash
   git clone https://github.com/Anonymous285714/SpatialViz-Bench.git
   cd SpatialViz-Bench
   ```

2. **Create and Activate a Virtual Environment and Install Dependencies:**

   You should follow all the requirements specified in the code repositories of the open-source models when setting up the environment. For the evaluation of closed-source models, only the following packages are required.

   ```txt
   openai
   datasets
   tqdm
   ```

## Configuration for Closed-Source Models

Before running the script, you may need to configure API keys. The script accepts these keys via command-line arguments.

* **Qwen API Key:** For accessing Qwen series models.
* **Doubao API Key:** For accessing Doubao series models.
* **OpenAI API Key:** For accessing OpenAI models (e.g., GPT-4o).
* **Gemini API Key:** For accessing Gemini series models.
* **OpenRouter API Key:** For accessing various models via OpenRouter.

Please ensure you have valid API keys for the models you intend to use.

## Running Evaluations

You can use the `evaluate.py` script to run evaluations for closed-source models.

The basic command structure is as follows:

```bash
python evaluate.py \
    --model_list "qwen2.5-vl-3b-instruct" "gpt-4o" \
    --benchmark_test_path "path/to/your/SpatialVizBench/SpatialViz_Bench_images" \
    --results_dir "path/to/your/results_directory" \
    --qwen_key "YOUR_QWEN_API_KEY" \
    --openai_key "YOUR_OPENAI_API_KEY" \
    # ... other API keys and arguments
```

You can use the `evaluate_xxx.py` script to run evaluations for specific open-source models.

The basic command structure is as follows:

```bash
python evaluate_xxxvl.py \
    --model_paths "path/to/download/xxx/models" \
    --benchmark_test_path "path/to/your/SpatialVizBench/SpatialViz_Bench_images" \
    --results_dir "path/to/your/results_directory" 
```

### Extract Answer from Results

The `get_answer` function in `evaluate.py` processes a results file (in JSONL format) generated by model inference. Its main purposes are:

1.  **Extracting Answers:** It parses the model's output to identify the predicted answer (A, B, C, or D) for each question. It can handle outputs with and without explicit `<answer>` tags, attempting to find the answer even in less structured responses.
2.  **Calculating Accuracy:**
    * It compares the predicted answer with the ground truth answer.
    * It calculates and stores accuracy at different granularities:
        * `overall`: Accuracy across all test instances.
        * `category`: Accuracy for each main category in the benchmark.
        * `task`: Accuracy for each specific task type.
        * `level`: Accuracy for combined `category-task-level` instances.
3.  **Recording Samples:**
    * It separates the evaluated instances into `positives` (correctly answered) and `negatives` (incorrectly answered).
    * Each sample in these lists includes the `DataID`, `InputText`, `Answer` (ground truth), and the model's `Response` or `ThinkingProcess` and `FinalAnswer`.
4.  **Saving Results:**
    * **Counting File:** It saves the accuracy statistics (number of correct predictions, total number of predictions, and accuracy percentage) for overall, category, task, and level into a JSON file (e.g., `results_MODELNAME_counting.json`) in the specified `counting` subdirectory.
    * **Samples File:** It saves the lists of positive and negative samples into a separate JSON file (e.g., `results_MODELNAME_samples.json`) in the specified `samples` subdirectory.
