# Schema2Opt: Synthetic Dataset Generation for Prescriptive Analytics Agent & Data2Decision: A Prescriptive Analytics System Bridging Enterprise Information and Optimal Decisions

This repository contains the implementation and evaluation framework for generating synthetic optimization problems from SQL database schemas, along with comprehensive baseline evaluations for text-to-optimization tasks. 

## Overview

Schema2Opt transforms SQL database schemas from the Spider dataset into realistic linear and mixed-integer optimization problems through an innovative alternating optimization approach. The system generates complete optimization problem descriptions, mathematical formulations, and solver implementations across multiple frameworks.

## Repository Structure

```
schema2opt/
├── baselines/                           # Baseline evaluation framework and results
│   ├── comprehensive_baseline_evaluation.py      # Main evaluation framework
│   ├── simple_zero_shot/                         # Simple baseline implementation
│   ├── or_llm_agent/                             # OR-LLM-Agent implementation
│   ├── Chain-of-Experts/                         # Chain-of-Experts implementation
│   ├── OptiMUS/                                  # OptiMUS implementation
│   └── [various evaluation configurations]       # Additional evaluation setups
├── schema2optsgd/                       # Core synthetic data generation
│   ├── text2opt_dataset_alternating_optimization/ # Generated synthetic dataset
│   ├── utils.py                                  # Utility functions
│   └── verification.py                           # Data verification tools
└── spider/                              # Original Spider dataset schemas
```

## Core Components

### 1. Synthetic Data Generation (`schema2optsgd/`)

**Generated Dataset**: `schema2optsgd/text2opt_dataset_alternating_optimization/`

The synthetic data generation pipeline implements an **Alternating Optimization Algorithm** with:
- **OR Expert**: Designs linear optimization formulations
- **Data Engineer**: Implements database schema modifications  
- **Triple Expert**: Generates realistic business data
- **Cross-Solver Validation**: Gurobipy, DOCplex, and Pyomo implementations

Each generated problem includes:
- Natural language problem description
- Mathematical formulation
- Multiple solver implementations (Gurobipy, DOCplex, Pyomo)
- Synthetic business data
- Solution verification results

### 2. Spider Dataset (`spider/`)

Original Spider text-to-SQL dataset schemas used as foundation for optimization problem generation. These schemas provide realistic database structures from various domains (e.g., university, company, entertainment) that are transformed into optimization contexts.

### 3. Baseline Evaluations (`baselines/`)

Comprehensive evaluation framework implementing a **two-stage evaluation approach**:
1. **Stage 1**: Generate optimization problems and retrieve relevant data from synthetic databases
2. **Stage 2**: Provide natural language optimization problems + retrieved data to baseline models for code generation and solution comparison

#### Implemented Baselines
1. **Data2Decision (Our Method)** (`majority_2_stage_mul_solver/`)
   - **Paper**: "Data2Decision: A Prescriptive Analytics System Bridging Enterprise Information and Optimal Decisions"
   - **Method**: Database-grounded prescriptive analytics with two-stage pipeline and test-time scaling
   - **Features**: Multi-solver consensus, majority voting, incremental temperature scaling

2. **Simple Zero-Shot** (`simple_zero_shot/`)
   - Direct model prompting for optimization code generation

3. **OR-LLM-Agent** (`or_llm_agent/`)
   - **Paper**: "Automating Modeling and Solving of Operations Research Optimization Problem with Reasoning Large Language Model"
   - **Code**: https://github.com/bwz96sco/or_llm_agent
   - **ArXiv**: https://arxiv.org/abs/2503.10009

4. **Chain-of-Experts** (`Chain-of-Experts/`)
   - **Paper**: "When LLMs Meet Complex Operations Research Problems"
   - **ArXiv**: https://openreview.net/forum?id=HobyL1B9CZ
   - **Code**: https://github.com/xzymustbexzy/Chain-of-Experts

5. **OptiMUS-0.3** (`OptiMUS/`)
   - **Paper**: "Using Large Language Models to Model and Solve Optimization Problems at Scale"
   - **Code**: https://github.com/teshnizi/OptiMUS/tree/optimus-v0.3
   - **ArXiv**: https://arxiv.org/abs/2407.19633


### 4. Evaluation Framework

**Main Evaluation Script**: `baselines/comprehensive_baseline_evaluation.py`


## Acknowledgments
This project uses the Spider dataset:
- Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task
- Authors: Tao Yu, Rui Zhang, Kai Yang, et al.
- License: CC BY-SA 4.0
- Website: https://yale-lily.github.io/spider

