# GSM8K-V: A Multimodal Mathematical Reasoning Benchmark

GSM8K-V is a comprehensive multimodal benchmark designed to evaluate mathematical reasoning capabilities of vision-language models (VLMs) through visual representations of mathematical word problems. The benchmark extends the original GSM8K dataset by converting textual mathematical problems into structured visual scenes that require both visual understanding and mathematical reasoning.

## Table of Contents

- [Overview](#overview)
- [Benchmark Construction](#benchmark-construction)
  - [Architecture](#architecture)
  - [Stage 1: Mathematical Information Extraction](#stage-1-mathematical-information-extraction)
  - [Stage 2: Scene Description Generation](#stage-2-scene-description-generation)
  - [Image Generation](#image-generation)
  - [Human Annotation](#human-annotation)
- [Benchmark Evaluation](#benchmark-evaluation)
  - [Supported Models](#supported-models)
  - [Evaluation Modes](#evaluation-modes)
  - [Data Categories](#data-categories)
  - [Running Evaluations](#running-evaluations)
- [Installation](#installation)
- [Usage](#usage)
- [Dataset Statistics](#dataset-statistics)
- [Citation](#citation)

## Overview

GSM8K-V addresses the limitations of text-only mathematical reasoning benchmarks by introducing visual representations that more closely mimic real-world mathematical problem solving. The benchmark consists of:

- **1,319 multimodal mathematical problems** derived from GSM8K test split
- **5,343 structured visual scenes** (4.05 images per problem on average)
- **6 major categories** with 13 subcategories for fine-grained analysis
- **Human-verified annotations** ensuring data quality (91.15% human accuracy)
- **Comprehensive evaluation framework** supporting both closed-source (API-based) and open-source (vLLM-based) models

## Benchmark Construction

The construction pipeline transforms textual GSM8K problems into rich multimodal datasets through a systematic two-stage process.

### Architecture

The construction pipeline consists of three main steps:

1. **Mathematical Information Decomposition and Allocation**: Parse textual problems into structured triples (object, math value, semantic) and allocate them across different scenes with controlled interference
2. **Scene Description Generation**: Generate structured scene descriptions using meta-description strategies for each mathematical information category
3. **Image Generation and Human Verification**: Generate multi-scene images and perform dual human cross-validation

```
GSM8K Text Problems → Step 1: Decomposition → Step 2: Scene Generation → Step 3: Image Generation → Human Verification → GSM8K-V Benchmark
                        ↓                     ↓                          ↓                        ↓                   ↓
            Math Info + Scene         Structured Scene            Multi-scene Images     Quality Control     Final Dataset
            Allocation          Descriptions (object, action, composition)
```

### Step 1: Mathematical Information Decomposition and Allocation

**Location**: `construct/step1.py`

This step systematically decomposes each GSM8K problem into structured mathematical representations and allocates them across scenes.

#### Mathematical Information Decomposition
Each problem is parsed into structured triples _(object, math value, semantic)_ using GPT-4.1:
- **object**: The entity described in the problem
- **math value**: Associated numerical attribute
- **semantic**: Contextual role of the information

#### Mathematical Information Classification
Math information is categorized into 13 distinct classes based on semantic and representational requirements (detailed in Appendix of paper).

#### Scene Allocation
Mathematical information is allocated across 2-11 scenes per problem following three principles:
- **Contextual grouping**: Related information grouped into same scene
- **Final isolation**: Problem question reserved for last scene
- **Atomic fidelity**: No inferred values, only extracted atomic facts

#### Multi-dimensional Interference
Controlled interference is introduced to increase task difficulty:
- **Perception interference**: Visually salient irrelevant objects
- **Semantic interference**: Contextually close distractors

**Usage**:
```bash
python construct/step1.py --input_file data/input/stage1_input.json --output_file data/output/stage1/results.json
```

### Step 2: Scene Description Generation

**Location**: `construct/step2.py`

This step generates structured scene descriptions for each mathematical information category.

#### Meta Description Strategy Definition
High-level templates are constructed for each mathematical information type, providing systematic guidance for visual representation. Meta-description strategies include specialized templates for:
- Time & clock representations
- Percentage and ratio visualizations
- Measurement displays
- Signboard and icon elements

#### Scene Description Generation
Based on selected meta strategies, GPT-4.1 generates structured scene descriptions following a tripartite schema:
- **object**: Concrete entities that must appear and carry mathematical information
- **action**: State or activity defining object presentation and conveying semantic cues
- **composition**: Spatial arrangement and positional relations among elements

This structured approach ensures consistency across scenes and provides explicit guidance for image generation.

**Usage**:
```bash
python construct/step2.py --input_file stage1_output.json --output_file stage2_output.json
```

### Step 3: Image Generation and Human Verification

**Location**: `construct/img_prompt_gen.py` (for prompt generation)

#### Multi-scene Image Generation
Using structured scene descriptions from Step 2, images are generated with GPT-Image-1 model. Each scene produces a 1024×1024 pixel square image following specific generation prompts.

#### Human Cross Check and Refinement
All generated images undergo dual human cross-validation by trained annotators, guided by three principles:
- **Consistency**: Visual scenes preserve entities, quantities, and constraints from original text
- **Completeness**: All information necessary for solving is visually accessible
- **Compliance**: Images adhere to safety standards and formatting rules (no sensitive content, clear object identities, legible numerals)

Violations trigger refinement; severe cases require manual scene description correction.

#### Benchmark Construction
The iterative generation-verification loop continues until every problem-image pair meets established requirements, resulting in GSM8K-V benchmark with semantic fidelity to GSM8K while enabling rigorous multimodal model evaluation.

### Human Annotation

**Location**: `construct/human_check/human_annotation.py`

A comprehensive Streamlit-based annotation interface for quality control:

- **Multi-user annotation system** with conflict resolution
- **Real-time progress tracking** and statistics
- **Image preloading** for smooth navigation
- **Annotation history** and consensus checking
- **File locking** to prevent concurrent access issues

**Usage**:
```bash
streamlit run construct/human_check/human_annotation.py
```

## Benchmark Evaluation

The evaluation framework provides comprehensive assessment capabilities for multimodal models across different reasoning paradigms.

### Supported Models

We evaluate a comprehensive range of both closed-source and open-source VLMs:

#### Closed-Source Models (API-based)
- **Gemini-2.5-Pro**: Google's latest multimodal model
- **GPT-5**: OpenAI's advanced GPT series
- **GPT-4o**: OpenAI's multimodal flagship model
- **QVQ-Max-Latest**: Qwen's vision-language model

#### Open-Source Models (vLLM-based)
- **Llama-4 series**: Meta's latest Llama models (17B-16E-Instruct, 17B-128E-Instruct)
- **InternVL3.5 series**: Strong multimodal models (8B, 38B, 30B-A3B, 241B-A28B)
- **Qwen2.5-VL series**: Alibaba's vision-language models (7B, 32B, 72B-Instruct)
- **Ovis2.5 series**: Efficient multimodal models (2B, 9B)
- **Step3**: StepFun's multimodal model
- **Kimi-VL-A3B-Thinking-2506**: Moonshot AI's reasoning-enhanced model
- **MiniCPM-V 4.5**: Tsinghua's compact multimodal model
- **GLM-4.5V**: ZhipuAI's vision-language model

**Note**: Closed-source models are evaluated through their official APIs, while open-source models use the vLLM framework for efficient inference.

### Evaluation Modes

1. **Text-Only**: Traditional text-based mathematical reasoning
2. **Visual**: Models solve problems using generated images with implicit scene understanding
3. **Scene**: Direct scene description-based reasoning

### Data Categories

The benchmark supports fine-grained evaluation across six major categories:

- **Measurement**: Distance, length, area, volume
- **Physical Metrics**: Speed, weight, density
- **Ratio & Percentage**: Proportions, percentages, scaling
- **Signboard & Icon**: Text-based visual information
- **Temporal**: Time, dates, calendars, clocks
- **Other**: Miscellaneous mathematical concepts

### Running Evaluations

**Basic Usage**:
```bash
python eval/eval.py --config config.json --models gpt-4-vision
```

**Advanced Usage**:
```bash
# Evaluate specific models with category filtering
python eval/eval.py \
  --config config.json \
  --models gpt-4-vision claude-3-5-sonnet \
  --data-categories measurement temporal \
  --evaluation-type api \
  --num-samples 1000
```

**Configuration** (`config.json`):
```json
{
  "data_path": "data/metadata/meta.json",
  "image_dir": "/path/to/images",
  "results_dir": "results",
  "modes": ["visual", "text_only"],
  "prompt_modes": ["implicit", "explicit"],
  "seed": 42,
  "num_samples": null,
  "evaluation_type": "api",
  "api_models": {
    "gpt-4-vision": {
      "enabled": true,
      "concurrency": 3
    }
  }
}
```

## Installation

### Prerequisites

- Python 3.8+
- OpenAI API key (for construction)
- Model API keys (for evaluation)

### Setup

```

1. **Install dependencies**:
```bash
pip install -r requirements.txt
```

2. **Set environment variables**:
```bash
export OPENAI_API_KEY="your-openai-key"
# Add other required API keys
```

## Usage

### Benchmark Construction Pipeline

1. **Prepare input data** (GSM8K format)
2. **Run Stage 1 extraction**:
   ```bash
   python construct/step1.py --input_file input.json --output_file stage1_output.json
   ```
3. **Run Stage 2 scene generation**:
   ```bash
   python construct/step2.py --input_file stage1_output.json --output_file stage2_output.json
   ```
4. **Generate images** using your preferred image generation model
5. **Human annotation**:
   ```bash
   streamlit run construct/human_check/human_annotation.py
   ```

### Benchmark Evaluation

1. **Configure evaluation settings** in `eval/config.json`
2. **Run evaluation**:
   ```bash
   python eval/eval.py --config config.json
   ```
3. **Analyze results** in the `results/` directory

## Project Structure

```
gsm8k-v/
├── construct/                    # Benchmark construction code
│   ├── step1.py                 # Mathematical information extraction
│   ├── step2.py                 # Scene description generation
│   ├── img_prompt_gen.py        # Image prompt generation
│   ├── human_check/
│   │   └── human_annotation.py  # Human annotation interface
│   └── prompt/                  # Prompt templates
├── eval/                        # Benchmark evaluation code
│   ├── eval.py                  # Main evaluation script
│   ├── config/                  # Configuration files
│   │   ├── model_config.py      # Model configurations
│   │   ├── evaluation_config.py # Evaluation settings
│   │   └── async_model_factory.py # Model factory
│   ├── models/                  # Model implementations
│   ├── utils/                   # Utility functions
│   └── prompts/                 # Evaluation prompts
├── data/                        # Dataset files
├── example/                     # Example data
└── requirements.txt             # Python dependencies
```

## Dataset Statistics

GSM8K-V contains comprehensive statistics for rigorous evaluation:

| Statistic | Value |
|-----------|-------|
| Total samples | 1,319 |
| Total categories | 6 |
| Total sub-categories | 13 |
| Answer type | Integer |
| Total images | 5,343 |
| Average images per problem | 4.05 |
| Maximum images per problem | 11 |
| Minimum images per problem | 2 |
| Human accuracy on GSM8K-V | 91.15% |

### Category Distribution
The benchmark covers six major categories:
- **Measurement**: Distance, length, area, volume
- **Physical Metrics**: Speed, weight, density
- **Ratio & Percentage**: Proportions, percentages, scaling
- **Signboard & Icon**: Text-based visual information
- **Temporal**: Time, dates, calendars, clocks
- **Other**: Miscellaneous mathematical concepts


