# Synthesizing-Geometry-Data


This repository contains the official implementation accompanying the **ICML 2026 submission**:

> **"Synthesizing Multimodal Geometry Datasets from Scratch and Enabling Visual Alignment via Plotting Code"**

This repository provides an end‑to‑end pipeline that starts from **symbolic geometry proofs**, synthesizes **natural language questions**, **plotting code**, and **high‑quality geometric figures**, and then performs **numeric correctness checks**. The same data can be further converted into **GRPO‑style training data** and used with **custom geometry rewards** for visual alignment of large multimodal models.

The codebase has been anonymized for submission (no institution‑specific paths or credentials).

---

## Repository Overview

```text
./
├── src/                      # Main library code
│   ├── core/                 # Core pipeline components
│   │   ├── LLMGenerator.py   # Geometry problem generation from symbolic input
│   │   ├── LLMJudge.py       # LLM-based quality control and filtering
│   │   ├── LLMPlotter.py     # Plotting code + geometry data generation
│   │   ├── Plotter.py        # Rendering engine for geometric figures
│   │   ├── NumericalCheck.py # Numeric/semantic answer verification
│   │   ├── VLImageQuality.py # Vision-language quality checking and captioning
│   │   └── pipeline.py       # Full end-to-end pipeline orchestration
│   ├── utils/                # Utility and configuration modules
│   │   ├── symbol_translator.py  # Symbolic geometry → natural language
│   │   ├── latex_parser.py       # LaTeX expression parsing helpers
│   │   ├── config.py             # Configuration management
│   │   ├── augment_config.py     # Data augmentation config helpers
│   │   ├── annotation_translator.py
│   │   ├── model_urls.py         # Model base URL registry
│   │   └── ...
│   └── scripts/              # Entry points for experiments
│       ├── run_pipeline.py   # Run the full geometry pipeline
│       ├── run_augment.py    # Run augmentation pipeline (if used)
│       └── extract_results.py# Convert raw outputs into compact JSON
├── data/                     # User-provided and generated data
│   ├── input/                # Input JSONL geometry problems
│   ├── output/               # Full pipeline outputs (JSONL)
│   └── figures/              # Rendered PNG figures
├── config/                   # Configuration files
│   ├── config.json           # Main runtime configuration
│   ├── augment_config.json   # Augmentation configuration
│   └── model_urls.json       # Mapping from model names to API endpoints
├── tools/                    # Reward + dataset conversion helpers
│   ├── geometry_reward.py    # Geometry-specific reward function (for RL)
│   ├── convert_to_grpo_parquet.py  # JSONL → Parquet for GRPO / VERL
│   └── split_train_val.py    # Utility to split JSONL into train/val
├── scripts/                  # Data preparation and debugging scripts
│   ├── extract_dataset.py    # Build geometry datasets from raw logs
│   ├── convert_to_llama_factory.py # Utilities to prepare data for LLaMAFactory (SFT)
│   └── debug.py              # Local debug examples for plotting / checking
├── requirements.txt          # Python dependencies
├── run.sh                    # Example shell entry for running the pipeline
```

---

## Quick Start

### 1. Install dependencies

```bash
pip install -r requirements.txt
```

This will install (among others):
- **openai** – for calling LLM and VLM APIs.
- **numpy**, **sympy** – numeric and symbolic computations.
- **opencv-python**, **Pillow** – figure rendering and image I/O.
- **streamlit** (optional) – for interactive inspection/visualization.

### 2. Configure models and paths

Edit `config/config.json`, for example:

```json
{
  "model": "gpt-4o-mini",
  "figures_dir": "data/figures",
  "delay": 1.0,
  "max_retries": 3,
  "max_workers": 1
}
```

Set environment variables for credentials (recommended):

```bash
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://api.openai.com/v1"  # Or a provider-specific URL
```

You can also adjust `config/model_urls.json` and `src/utils/model_urls.py` to point to alternative providers or local gateways.

### 3. Prepare input data

Place geometry problems (JSONL) into `data/input/`.  
Each line is a JSON object derived from symbolic geometry, e.g.:

```json
{
  "llm_input_renamed": "<problem> a : ; b : ; c d : coll a b c [000] coll a b d [001] ? coll b c d </problem>",
  "llm_output_renamed": "<proof> coll b c d [002] AR [000] [001] ; </proof>",
  "original_constructions": "..."
}
```

### 4. Run the main pipeline

From the repository root:

```bash
python src/scripts/run_pipeline.py \
  --input data/input/geometry.jsonl \
  --output data/output/results.jsonl \
  --start_index 0 \
  --end_index 10 \
  --config config/config.json
```

**Arguments:**
- **`--input`**: path to the input JSONL file.
- **`--output`**: path to the output JSONL file.
- **`--start_index` / `--end_index`**: index range of samples to process (useful for sharding and resuming).
- **`--config`**: path to configuration; defaults to `config/config.json` if omitted.

### 5. Extract compact results

To extract a concise summary from the verbose pipeline outputs:

```bash
python src/scripts/extract_results.py \
  data/output/results.jsonl \
  data/output/extracted_results.json
```

Each extracted entry includes:
- **`index`** – sample index.
- **`problem`** – generated natural language question.
- **`cot`** – chain‑of‑thought reasoning.
- **`answer`** – final answer (often in LaTeX).
- **`figure`** – path to the rendered figure.

---

## Output Format

The full pipeline output (JSONL) is structured as:

```json
{
  "index": 0,
  "status": "success",
  "problem_context": {
    "conditions": "...",
    "conclusion": "...",
    "constructions": "..."
  },
  "generation": {
    "status": "success",
    "question": "...",
    "cot": "...",
    "answer": "..."
  },
  "validation": {
    "status": "success",
    "passed": true,
    "reason": "...",
    "score": 85
  },
  "plotting": {
    "status": "success",
    "code": "...",
    "plotting_data": { "...": "..." },
    "figure_path": "data/figures/figure_0.png"
  },
  "numerical_check": {
    "status": "success"
  },
  "token_usage": 1234
}
```

The `plotting.plotting_code` field stores the intermediate geometry representation (points, segments, circles, annotations, and target quantity descriptors) that we use for both rendering and numeric checks.

---

## Core Modules

### `src/core/` – Main pipeline components

- **`LLMGenerator.py`**  
  - **Input**: symbolic geometry problem data (conditions, conclusion, constructions).  
  - **Output**: natural language problem, chain‑of‑thought reasoning, and final answer (LaTeX).  
  - **Role**: converts proof‑style symbolic input into readable geometry questions using LLMs.

- **`LLMJudge.py`**  
  - **Input**: generated problem, reasoning, answer.  
  - **Output**: pass/fail, score, and textual justification.  
  - **Role**: filters out low‑quality questions to ensure dataset quality.

- **`LLMPlotter.py`**  
  - **Input**: natural language geometry problem.  
  - **Output**: plotting code (both executable Python and a structured representation):
    - Point coordinates.
    - Segments, circles, and other primitives.
    - Annotations (angles, lengths, right‑angles).
    - DSL expressions for target quantities.  
  - **Role**: synthesizes plotting code and geometry specifications that align with the problem statement.

- **`Plotter.py`**  
  - **Input**: plotting code (structured representation).  
  - **Output**: PNG geometry figures (saved in `data/figures/`).  
  - **Role**: draws high‑quality figures using OpenCV and PIL, including labels, marks, and annotations.

- **`NumericalCheck.py`**  
  - **Input**: predicted answer string and plotting metadata (including quantity DSL).  
  - **Output**: a correctness flag and auxiliary info.  
  - **Role**: evaluates whether the answer is numerically consistent with the constructed geometry.

- **`VLImageQuality.py`**  
  - **Input**: rendered image and plotting code (structured representation).  
  - **Output**: quality scores and captions from a vision‑language model.  
  - **Role**: optional automatic checking of figure clarity and semantic alignment.

- **`pipeline.py`**  
  - **Role**: orchestrates the entire pipeline:
    - Symbolic → natural language question generation.
    - LLM‑based judging.
    - Plotting code generation.
    - Rendering and numeric verification.  
  - **Features**: multi‑threaded execution, resumability via `start_index` / `end_index`, robust error handling.

### `src/utils/` – Utilities and configuration

- **`symbol_translator.py`**: translates the symbolic geometry DSL (e.g., `coll a b c`) into natural language statements for conditions and goals.
- **`latex_parser.py`**: parses LaTeX expressions into SymPy expressions / floats, enabling robust numeric comparison of answers.
- **`config.py`**: loads and exposes configuration values to the rest of the pipeline.
- **`augment_config.py`**, **`annotation_translator.py`**: helper modules for data augmentation and annotation handling.
- **`model_urls.py`**: registry mapping logical model names to actual API base URLs.

---

## Data Directories

- **`data/input/`**  
  - JSONL input problems (one JSON object per line).

- **`data/output/`**  
  - Raw pipeline outputs (JSONL), including intermediate states and logs.

- **`data/figures/`**  
  - PNG figures rendered by `Plotter.py` (e.g., `figure_{index}.png`).

---

## Training Setup (SFT and RL)

- **Supervised fine-tuning (SFT)**: we prepare the synthesized geometry data with  
  `scripts/convert_to_llama_factory.py` and train models using **LLaMAFactory**.
- **Reinforcement learning (RL)**: we convert JSONL outputs to Parquet with  
  `tools/convert_to_grpo_parquet.py` and train with **VERL** (GRPO), using  
  `tools/geometry_reward.py` as a **geometry-specific reward** that combines
  formatting and answer correctness.

---

## Debug and Example Scripts

- **`scripts/debug.py`**  
  - Contains local examples for:
    - Testing `LLMPlotter` and `Plotter` on hand‑crafted problems.
    - Evaluating `NumericalCheck` on complex LaTeX answers.
    - Probing `VLImageQuality` and `VisualizeQA` behaviors.

These scripts are not required for standard usage but are useful for sanity checks and extending the system.

---

## Relation to the ICML 2026 Paper

This repository implements the pipeline and training utilities described in  
**"Synthesizing Multimodal Geometry Datasets from Scratch and Enabling Visual Alignment via Plotting Code"** (ICML 2026 submission).  
It covers:
- **Symbolic‑to‑multimodal dataset synthesis** (text + plotting code + figures).
- **Automatic quality control** (LLM judging, numeric checks, and optional VLM image QA).
- **Reward design and GRPO training** for visual alignment on geometry problems.

BibTeX and final citation details will be added after the review process.

