Abstract: This paper discusses the scene change captioning task that describes scene changes using natural language for real scenarios. Most current three-dimensional understanding tasks focus on recognizing static scenes. Despite its importance in a variety of real environment applications, scene change understanding remains less discussed. Existing change understanding methods discussed in robotics focus on change detection and lack the ability to perform detailed recognition of scene changes. Most previous experiments on change captioning methods were conducted on simulation datasets with limited visual complexity, limiting their availability for real scenarios. To solve the above issues, we propose a scene change captioning dataset with scenes photographed using RGB-D cameras. We also propose an automatic simulation dataset generation process, aiming for training models transferring to real scenarios. We conducted experiments with various input modalities and proposed a method that integrates different input modalities using an attention mechanism over modalities and dynamic attention to select related information during the sentence generation process. The experimental results show that models trained on the proposed simulation dataset obtained promising results on real scenario dataset, indicating the proposed dataset generation process’s practicality in real scenarios. The proposed multimodality integrating method can generate change captions with high change type and object attribute accuracy while showing robustness in real scenarios. We hope our work can open a door for future research on scene change understanding in real scenarios.
0 Replies
Loading