Where is the Invisible: Spatial-Temporal Reasoning with Object Permanence

24 Sept 2023 (modified: 05 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Object Permanence, Visual Relational Reasoning, Trajectory Prediction
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We propose a Qualitative-Quantitative Spatial-Temporal Reasoning framework (QQ-STR) for Invisible Object Tracking, and demonstrate the effectiveness and robustness of the proposed method on both synthetic datasets and real-world datasets.
Abstract: Object permanence is a cognitive ability that enables humans to reason about the existence and location of objects that are not visible in the scene, such as those occluded or contained by other objects. This ability is crucial for visual object tracking, which aims to identify and localize the target object across video frames. However, most existing tracking methods rely on deep learning models that learn discriminative visual features from the visual context and fail to handle the cases where the object disappears from the image, e.g., occluded or contained by other objects. In this paper, we propose a novel framework for tracking invisible objects based on Qualitative-Quantitative Spatial-Temporal Reasoning (QQ-STR), inspired by the concept of object permanence. Our framework consists of three modules: a visual perception module, a qualitative spatial relation reasoner (SRR), and a quantitative relation-conditioned spatial-temporal relation analyst (SRA). The SRR module infers the qualitative relationship between each object and the target object based on the current and historical observations, while the SRA module predicts the quantitative location of the target object based on the inferred relationship and a diffusion model that captures the object's motion. We devise a self-supervised learning mechanism that does not require explicit relation annotations and leverages the predicted trajectories to locate the invisible object in videos. We evaluate our framework on a synthetic dataset (LA-CATER) and a new real-world RGB-D video dataset for invisible object tracking (iVOT) that contains challenging scenarios of human-object interactions with frequent occlusion and containment events. Our framework achieves comparable performance to state-of-the-art tracking methods that use additional relation annotations, demonstrating its generalization ability to novel scenes and viewpoints.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9071
Loading