On the Benefits of Instance Decomposition in Video Prediction Models

TMLR Paper5296 Authors

04 Jul 2025 (modified: 09 Sept 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Video prediction is a crucial task for intelligent agents such as robots and autonomous vehicles, since it enables them to anticipate and act early on time-critical incidents. State-of-the-art video prediction methods typically model the dynamics of a scene jointly and implicitly, without any explicit decomposition into separate objects. This is challenging and potentially sub-optimal, as every object in a dynamic scene has their own pattern of movement, typically somewhat independent of others. In this paper, we investigate the benefit of explicitly modeling the objects in a dynamic scene separately within the context of latent-transformer video prediction models. We conduct detailed and carefully-controlled experiments on both synthetic and real-world datasets; our results show that decomposing a dynamic scene leads to higher quality predictions compared with models of a similar capacity that lack such decomposition.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=lyqhffQbS7
Changes Since Last Submission: - Intuition and aim of the paper in the introduction section is made more clear, see page 1, 2 - Additional description for SNCAT variant of our model is added on the caption of Figure 1 and Figure 2 - Definition of evaluation metrics are added, see page 8 - Clearly stated that our paper does not address the problem of background motion, see page 8 - FLOP analysis is added for each model variants and compared with the baselines - Model's robustness is tested by simulating segmentation errors via dilation and erosion, see page 11, 12, 14 - Best, worst and average cases of our sampling strategy and relevant analysis is added, see page 9, 10, 13 - Limitation section is added, see page 15 - Further details on each of these changes are given in the individual reviewer responses below
Assigned Action Editor: ~Masha_Itkina1
Submission Number: 5296
Loading