# SIMUPHY: TOWARDS PHYSICAL UNDERSTANDING,REASONING, AND EVALUATION VIA CODE GENERATION


This repository provides the full pipeline for constructing and validating the **SimuPhy** dataset.  
The process consists of six main steps:

---

### Step 1: Data Construction
**File:** `1_dataConstruction.py`  
- Generate **basic scenarios** across physics domains.  
- Extend them with **conditions** (e.g., collisions, external forces).  
- Formulate **dynamic scenarios**.  
- Check whether each scenario is **physically plausible**.  
- Generate initial **verification questions (VLQs)**.

---

### Step 2: Verification Question Review
**File:** `2_verifyVLQ.py`  
- Perform a **second-layer review** of the generated verification questions.  
- Filter or regenerate insufficient or ambiguous questions.  

---

### Step 3: Code Generation
**File:** `3_get_code_from_llm.py`  
- Query LLMs to produce **reasoning traces (CoT)**.  
- Generate **Python simulation code** for each scenario.  

---

### Step 4: Video Rendering
**File:** `4_genVideo.py`  
- Execute the generated code.  
- Render the corresponding **simulation videos**.  

---

### Step 5: VLM Judgment
**File:** `5_VLMJudge.py`  
- Feed the videos, scenario descriptions, and VLQs into a **Vision–Language Model (VLM)**.  
- Obtain judgments (**True / False / Not Sure**) with confidence scores.  

---

### Step 6: Validation
**File:** `6_validation.py`  
- Collect and summarize VLM judgments.  
- Produce the **final evaluation results** of text–code–video consistency.  

---

## Notes
- `code_utils.py` and `llm_utils.py` are **utility libraries** used across the above steps.  
- `llm_prompts.py` contains **prompt templates** for comparing **VLM-as-Judge** and **LLM-as-Judge** approaches.  
- `tokens.py` is used to **analyze token lengths** in model responses.  




