InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning

InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning

ICLR 2026 Conference Submission14468 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large multimodal models, physical reasoning, inductive reasoning

TL;DR: We evaluate whether large multimodal models can infer and apply unseen physical laws from demonstration samples.

Abstract: Large multimodal models (LMMs) encode universal physical laws observed during training, such as momentum conservation, as parametric knowledge. It allows LMMs to answer physical reasoning queries, such as the outcome of a potential collision event from visual input. However, since parametric knowledge includes only the physical laws seen during training, it is insufficient for reasoning when the inference scenario follows physical laws unseen during training. In contrast, humans can adapt their physical reasoning to unseen physical environments with only a few visual examples. This inductive physical reasoning ability is indispensable for LMMs if they are to replace human agents in safety-critical applications. Despite its importance, existing visual benchmarks evaluate only the parametric knowledge in LMMs, and not inductive physical reasoning. To this end, we propose InPhyRe, the first visual question answering benchmark to measure inductive physical reasoning in LMMs. InPhyRe evaluates LMMs on their ability to predict the outcome of collision events in algorithmically generated synthetic videos. By inspecting over 13 open-source and proprietary LMMs, InPhyRe informs us that (1) LMMs struggle to apply their limited parametric knowledge about universal physical laws to reasoning, (2) inductive physical reasoning in LMMs is weak when inference scenarios obey physical laws unseen during training, and (3) inductive physical reasoning in LMMs suffers from language bias and largely ignores the visual inputs, questioning the trustworthiness of LMMs regarding visual inputs.

Primary Area: datasets and benchmarks

Submission Number: 14468

Loading