# CerebraGloss-Bench

## Structure

```
.
├── benchmark.json
├── img
│   ├── 102_av.jpg                  # 90 jpg files
│   ├── ...
│   └── 97_av.jpg
├── npy
│   ├── 102_av.npy
│   ├── ...                         # 90 npy files
│   └── annotation.json             # gt bboxes
├── README.md
├── results                         # model output
│   ├── cerebragloss_results.jsonl
│   ├── gemini2.5pro_results.jsonl
│   └── gpt5_results.jsonl
└── scripts
    ├── build_img_benchmark.py      # npy to jpg
    ├── eval.py                     # eval description, MCQ and QA
    ├── output_gen.py               # prompt gpt and gemini
    └── vis.py                      # vis benchmark and model output
```

## Environment Setup

```bash
pip install streamlit openai evaluate rouge-score scikit-learn
```

## Visualization

Run the visualization script by the following command:

```bash
streamlit run ./scripts/vis.py
```

You can select the sample ID from the sidebar to view the corresponding EEG data. Each sample includes a Summary, a multi-choice Selection, and a Conversation generated by GPT-5.

Also, we provide `scripts/build_img_benchmark.py`, which allows you to build images from EEG npy files. However, we have already built for you in `img` folder.

## Evaluation

We provide an evaluation script to assess the performance of your model or reproduce our results.

- First, prepare the evaluation file.

  - We provide scripts to use *openrouter api* to generate responses for all three types of questions (Summary, Selection, Conversation). You may change the setting in `./scripts/output_gen` to use different model

  ```bash
  python ./scripts/output_gen.py
  ```

  - Also you can use your own model. *Making sure the generated responses is saved in `./results/` in JSONL format and each line is a JSON object with the following structure:*

    ```json
    {
      "image": "91_av.npy",
      "output": {
        "summary": "Your generated summary here.",
        "selection": "Your generated selection here.",
        "conversation": "Your generated conversation here."
      }
    }
    ```

  - You can also visualize the generated results using the visualization script mentioned above (load in the sidebar).

  Here we provide our models' results (CerebraGloss-3B) for your reference, saved in `./results/cerebragloss_results.jsonl`

- Then, run `./scripts/eval.py` to evaluate the generated results.

  **1. Configuration (Required for 'conversation' task)**

  The script uses an OpenAI-compatible API for the `conversation` task. You must set your API key as an environment variable named `OPENAI_API_KEY`. *By default*, the script uses the OpenRouter API endpoint and the GPT-5 model for evaluation.

  **On Windows (PowerShell):**

  ```powershell
  $env:OPENAI_API_KEY="your_api_key_here"
  ```

  **On Linux/macOS:**

  ```bash
  export OPENAI_API_KEY="your_api_key_here"
  ```

  You can also optionally modify the `BASE_URL` and `MODEL_NAME` constants at the top of the `eval.py` script to change the API endpoint or the evaluation model.

  **2. Running the Evaluation**

  The script evaluates model performance using the following metrics:
  - **Summary**: ROUGE score.
  - **Selection**: Accuracy.
  - **Conversation**: GPT-based score (1-10).

  Use the following command structure to run the evaluation:

  ```bash
  python scripts/eval.py --results_file <path_to_results> [options]
  ```

  **Command-Line Arguments:**

  |Argument|Description|Required|Default|
  |-|-|-|-|
  |`--results_file`|Path to the model's output file (e.g., `results/gpt5_results.jsonl`)|**Yes**|N/A|
  |`--tasks`|A space-separated list of tasks to evaluate. Choose from `summary`, `selection`, `conversation`.|No|`summary` `selection` `conversation`|
  |`--save`|Path to save the evaluation results as a JSON file. If not provided, results are only printed to the console.|No|`None`|

  **Examples:**

  - **Evaluate all tasks for a model and save the results:**
  
    ```bash
    python scripts/eval.py --results_file results/cerebragloss_results.jsonl --save cerebragloss_eval.json
    ```

  - **Evaluate `selection` and `summary` (do not save):**

    ```bash
    python scripts/eval.py --results_file results/cerebragloss_results.jsonl --tasks selection summary
    ```
