CausalPhysics: Unifying Semantic Reasoning, Physical Dynamics, and Counterfactual Simulation in World Models

Published: 02 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Workshop World ModelsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: world models, causal reasoning, physical AI, counterfactual simulation, video understanding, physics-informed learning, graph neural networks, semantic grounding
TL;DR: A unified framework combining semantic understanding, causal graph learning, and physics constraints that achieves 46.8% on Physics-IQ benchmark (+95% over Sora) and 71.3% causal consistency (+226% over GPT-4V).
Abstract: Current world models fragment physical intelligence into separate pipelines. Vision language models (VLMs) excel at semantic tasks but struggle with causal physical reasoning: on our CAUSALPHYSICS-BENCH evaluation, GPT-4V answers only 21.9% of counterfactual physics queries correctly. Video generators produce realistic frames but understand little physics: Sora attains 24.1%, Runway Gen-3 23.2%, and VideoPoet 21.4% on Physics-IQ (Motamed et al., 2025). Model-based reinforcement learning (MBRL) systems operate in narrow domains and lack semantic grounding. We present CAUSALPHYSICS, a single architecture that bridges these gaps with three tightly coupled modules: (1) a Semantic-Physical Encoder (SPE) that fuses DINOv2 vision tokens with frozen LLaMA-2 language representations through cross-attention; (2) a Causal Graph Induction Module (CGIM) that discovers a differentiable structural causal model from video, supporting Pearl’s do-operator and counterfactual queries; (3) a Physics-Constrained Dynamics Network (PCDN) that propagates states through the learned causal graph while enforcing differentiable conservation-law constraints. On the official Physics-IQ v1.0 toolkit, CAUSALPHYSICS scores 46.8 ± 0.9—a 47% relative gain over V-JEPA 2 (31.8 ± 1.4) and roughly double Sora (24.1). Causal consistency reaches 71.3 ± 1.2% on CAUSALPHYSICS-BENCH versus 21.9 ± 0.8% for GPT-4V ($p<0.001$, paired t-test, 3 seeds). Out-of-distribution (OOD) generalization improves by 20.2 percentage points over the strongest baseline
Submission Number: 39
Loading