Vision Language Models Cannot Reason About Physical Transformation

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision language models, physical transformation, multi-image understanding, spurious correlation
Abstract: Understanding physical transformations is fundamental for reasoning in dynamic, real-world environments. While Vision Language Models (VLMs) show promises in embodied applications grounded in the physical world, whether they genuinely understand physical transformations remains unclear. To address this gap, we introduce \textit{ConservationBench} to evaluate \textit{conservation}—whether physical quantities remain invariant under transformations despite appearance changes. Spanning four quantitative properties (number, length, volume, size), each task requires integrating visual evidence across time and includes counterfactuals where the targeted quantities are not conserved, forming paired conserving and non-conserving scenarios. With systematic variation in prompts, frame sampling methods, and task design, we generate 13,824 questions evaluating on 34 VLMs. Results reveal consistent failure: none demonstrates systematic conservation. Performance remains marginally above chance, with improvements on conservation tasks often accompanied by severe performance on counterfactual controls. This suggests a dependence on superficial patterns or shortcuts over genuine understanding and reasoning on conservation. Moreover, models show no benefit from higher temporal resolution or prompt design. Together, these findings indicate that current VLMs fail to reason about physical transformation.
Primary Area: datasets and benchmarks
Submission Number: 23583
Loading