# InteractScience Benchmark

InteractScience is a benchmark specifically designed to evaluate the capability of large language models in generating interactive scientific demonstration code. This project provides a complete evaluation pipeline including model inference, automated testing, and multi-dimensional assessment.

## 📁 Directory Structure

```
.
├── data/                           # Benchmark dataset
│   ├── interactscience.jsonl       # Main dataset file containing problems and references
│   └── snapshots/                  # Reference screenshot directory
│       ├── *_Snapshot-1.png
│       ├── *_Snapshot-2.png
│       └── ...
├── PFT_tests/                      # Program Functionality Testing (PFT) scripts
│   ├── *.spec.js                   # Playwright test scripts
│   └── ...
├── VQT_tests/                      # Visual Quality Testing (VQT) scripts
│   ├── *.spec.js                   # Playwright test scripts
│   └── ...
├── eval/                           # Model inference results
│   ├── interactscience_lm_*.jsonl  # Language model inference results
│   ├── interactscience_vlm_*.jsonl # Vision-language model inference results
│   └── ...
├── results/                        # Test result data
│   ├── lm_results/                 # Language model test results
│   │   ├── PFT_test_results/       # Program functionality test results
│   │   ├── VQT_test_results/       # Visual quality test results
│   │   ├── VQT_clip_results/       # CLIP scoring results
│   │   └── VQT_vlm_judge_results/  # VLM scoring results
│   └── vlm_results/                # Vision-language model test results
├── run_generation.sh               # Model inference script
├── run_benchmark.sh                # Automated testing script
├── run_vlm_as_judge.sh             # VLM scoring script
├── cal_metrics.py                  # Metrics calculation script
├── test_llm.py                     # Language model testing main program
├── vlm_as_judge.py                 # VLM scoring main program
├── clip_score.py                   # CLIP score calculation
└── extract_and_save_code.py        # Code extraction and saving
```

## 🚀 Usage Tutorial

### 1. Environment Setup

First install Node.js and npm, then install the Playwright testing environment:

```bash
# Install project dependencies
npm install

# Install Playwright browsers
npx playwright install
```

### 2. Model Inference

Use the `run_generation.sh` script for model inference:

```bash
# Edit the model path and parameters in the script
vim run_generation.sh

# Run inference (requires model path configuration)
bash run_generation.sh
```

**Script Description:**
- Starts vLLM API server
- Calls `test_llm.py` for inference
- Results saved to `eval/` directory

### 3. Automated Testing

Use the `run_benchmark.sh` script for automated testing:

```bash
# Set the model name to test
export MODEL="your_model_name"

# Run tests
bash run_benchmark.sh
```

**Testing Process:**
1. Extract HTML code from inference results (`extract_and_save_code.py`)
2. Execute Program Functionality Testing (PFT) using `playwright_PFT.config.js`
3. Execute Visual Quality Testing (VQT) using `playwright_VQT.config.js`
4. Calculate CLIP similarity scores (`clip_score.py`)
5. Results saved to `results/` directory

### 4. VLM Scoring

Use `run_vlm_as_judge.sh` for VLM-as-Judge evaluation:

```bash
# Edit model and path configuration in the script
vim run_vlm_as_judge.sh

# Run VLM scoring
bash run_vlm_as_judge.sh
```

**Scoring Description:**
- Uses vision-language models to score generated results
- Compares reference screenshots with generated screenshots
- Evaluation based on predefined checklists

### 5. Results Analysis

Use `cal_metrics.py` and `cal_vlm_as_judege_score.py` to calculate final metrics:

```bash
python cal_metrics.py
python cal_vlm_as_judege_score.py
```

## 📊 Dataset Description

### interactscience.jsonl
Main dataset file, each line contains a test sample:
- `id`: Unique identifier
- `question`: Detailed HTML implementation plan
- `lm_system_prompt`: Language model system prompt
- `vlm_system_prompt`: Vision-language model system prompt
- `image_path`: List of reference screenshot paths
- `snapshot_checklists`: Visual verification checklists

### Reference Screenshots
Located in `data/snapshots/` directory, naming format:
- `{task_id}_Snapshot-{number}.png`

## 🧪 Test Types

### 1. Program Functionality Testing (PFT)
- Validates functional correctness of HTML code
- Checks interactive element behavior
- Tests JavaScript logic

### 2. Visual Quality Testing (VQT)
- Generates page screenshots
- Compares with reference screenshots
- Calculates perceptual similarity (CLIP scores)
- Calculates semantic correctness (VLM-judge scores)

## 🛠️ Core Scripts Description

### test_llm.py
Language model testing main program:
```bash
python test_llm.py \
    --dataset_path data/interactscience.jsonl \
    --prompt_type lm_system_prompt \
    --dump_path eval/result.jsonl \
    --model_path your_model_path \
    --base_url http://localhost:8000/v1 \
    --api_key EMPTY
```

### vlm_as_judge.py
VLM scoring main program:
```bash
python vlm_as_judge.py \
    --reference_image_dir data/snapshots \
    --generated_image_dir generated_images \
    --checklist_file data/checklists.jsonl \
    --output_path results/vlm_judge.jsonl \
    --base_url your_api_endpoint \
    --api_key your_api_key
```

### clip_score.py
CLIP score calculation:
```bash
python clip_score.py \
    --jsonl_path data/question_lm.jsonl \
    --reference_dir data/snapshots \
    --prediction_dir snapshots
```

## 📈 Evaluation Metrics

- **Program Functionality Test Pass Rate**: Percentage of PFT test cases passed
- **Visual Quality Score**: Visual similarity based on CLIP model
- **VLM Score**: Comprehensive score given by multimodal models
- **Task Completion Rate**: Completion status based on checklists

## 🔧 Configuration Files

### playwright_PFT.config.js
Configures program functionality testing:
- Test timeout settings
- Browser configuration
- Report format

### playwright_VQT.config.js
Configures visual quality testing:
- Screenshot parameters
- Viewport size
- Wait strategies