Benchmarking Progress to Infant-Level Physical Reasoning in AI
Abstract: To what extent do modern AI systems comprehend the physical world? We introduce the open-access Infant-Level Physical Reasoning Benchmark (InfLevel) to gain insight into this question. We evaluate ten neural-network architectures developed for video understanding on tasks designed to test these models' ability to reason about three essential physical principles which researchers have shown to guide human infants' physical understanding. We explore the sensitivity of each AI system to the continuity of objects as they travel through space and time, to the solidity of objects, and to gravity. We find strikingly consistent results across 60 experiments with multiple systems, training regimes, and evaluation metrics: current popular visual-understanding systems are at or near chance on all three principles of physical reasoning. We close by suggesting some potential ways forward.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: ### Additional context regarding metrics We have clarified our position regarding our metrics in Section 3.4 where we now say: > For this reason, InfLevel is an evaluation-only benchmark: no training on InfLevel is allowed. This has an unfortunate implication we must overcome: models that wish to report scores on InfLevel must be able to provide a scalar “surprise” score for every input video. This requirement, used by other benchmarks (Riochet et al., 2018), is limiting as it requires that anyone wishing to evaluate on InfLevel to train a special “surprise” decoder on their model using some external data source. To circumvent this problem, we model surprise as out-of-domain (OOD) detection using the intuition that models with sufficient physical understanding should consider physically implausible events more out-of-domain than physically plausible ones. In Sec. 4, we propose several OOD metrics, each of which takes a representation of a video and returns a scalar quantifying how out-of-domain the video is. While we show, in Sec. 4, that this OOD approach is empirically promising, it is easy to show that there is no surprise metric which can be used to detect physical understanding for all possible models (see App. E.2). Given this, anyone evaluating on InfLevel is free to define their own surprise metrics so long as: (1) the same metric is used across all subsets of InfLevel (i.e. there should not be one metric for Continuity and another for Gravity) and (2) these surprise metrics are neither trained on InfLevel nor designed explicitly to exploit regularities in InfLevel data (which would be an implicit form of training). Going forward, we hope that researchers will continue to improve and refine the model-agnostic OOD surprise measures we propose. ### Typos and minor changes We have fixed some typos and added one additional concurrent work to our related work section.
Assigned Action Editor: ~Josh_Merel1
Submission Number: 301