ScenePhys — Controllable Physics Videos for World-Model Evaluation

Arshia Hemmat; Emad Aghahosseini; Alireza Nasri; Mohammad Hossein Shaker Ardakani; Amirmasoud Rismanchian; Ali Mamanpoosh; Afsaneh Fatemi

ScenePhys — Controllable Physics Videos for World-Model Evaluation

Arshia Hemmat, Emad Aghahosseini, Alireza Nasri, Mohammad Hossein Shaker Ardakani, Amirmasoud Rismanchian, Ali Mamanpoosh, Afsaneh Fatemi

Published: 19 Sept 2025, Last Modified: 19 Sept 2025NeurIPS 2025 Workshop EWMEveryoneRevisionsBibTeXCC BY 4.0

Keywords: physics video QA, PhET, vision–language models, video understanding, robustness, educational simulations, benchmarking, world models

TL;DR: PhET-Physics-VideoQA is a controlled video benchmark from educational simulations. Each clip has conceptual, numerical, and error-detection questions with metadata. A reproducible protocol reveals VLM weaknesses on traps and quantum cases.

Abstract: We present PhET-Physics-VideoQA, a controlled benchmark for assessing physics understanding in vision--language models (VLMs) from video. The corpus comprises 382 short clips sourced from PhET Interactive Simulations, covering 17 topics across four fields (Mechanics \& Fluids, Optics, Electromagnetism \& Circuits, and Quantum Mechanics). Each clip is paired with a triad of expert-validated questions—conceptual, numerical, and error-detection—yielding 1{,}146 Q/A items. The design emphasizes pixel-grounded reasoning: many clips display gauges and sliders so that models must recover numeric values from frames rather than rely on language priors. Evaluation is reproducible and type-specific. Numerical items are graded deterministically against gold values with absolute/relative tolerances and unit checks. Conceptual and error-detection items are judged with a rubricized LLM that returns strict JSON, supports dual-judge scoring, and is run at zero temperature with cached transcripts. We report results for three video-capable VLMs (GPT-4o-mini, Gemini-2.5-Flash-Lite, Qwen-VL-Plus). Across domains, error-detection (“trap”) questions are consistently the most difficult, typically scoring 0.5–1.3 points lower than conceptual or numerical items on a 1–5 scale. Higher-concept physics, particularly quantum content, remains challenging for all models. PhET-Physics-VideoQA thus offers a rigorous, transparent, and cost-efficient testbed for measuring genuine physics competence in video settings and a practical resource for advancing research on multimodal world.\footnote{Project: \url{https://scenephys.github.io/}\,; Dataset: \url{https://huggingface.co/datasets/ScenePhys/ScenePhys}\,; Code: \url{https://github.com/ScenePhys/codebase}.}

Submission Number: 21

Loading