Abstract: In recent years, prediction models for the real world have been widely proposed. Most research that deals with the recognition and prediction of the real world generates prediction results from visual predictions such as changes in pixels or numerical changes in physical simulators, and few models can predict them based on both visual and physical characteristics, as humans can. Therefore, in this study, we constructed a new prediction model based on both visual information and physical characteristics in the environment by integrating the mechanism of variational temporal abstraction, which extracts change points in the observation environment from visual information, into PreCNet. Furthermore, to make the prediction results interpretable, we generated the inferred prediction content as a sentence. In addition, we verified whether the generated sentences could explain collision situations in as much detail as a human being when given physical common sense about the environment, such as the movement and mass of objects.
Loading