Abstract: Captioning images is a challenging scene-understanding task that connects computer vision and natural language processing. While image captioning models have been successful in producing excellent descriptions, the field has primarily focused on generating a single sentence for 2D images. This paper investigates whether integrating depth information with RGB images can enhance the captioning task and generate better descriptions. For this purpose, we propose a Transformer-based encoder-decoder framework for generating a multi-sentence description of a 3D scene. The RGB image and its corresponding depth map are provided as inputs to our framework, which combines them to produce a better understanding of the input scene. Depth maps could be ground truth or estimated, which makes our framework widely applicable to any RGB captioning dataset. We explored different fusion approaches to fuse RGB and depth images. The experiments are performed on the NYU-v2 dataset and the Stanford image paragraph captioning dataset. During our work with the NYU-v2 dataset, we found inconsistent labeling that prevents the benefit of using depth information to enhance the captioning task. The results were even worse than using RGB images only. As a result, we propose a cleaned version of the NYU-v2 dataset that is more consistent and informative. Our results on both datasets demonstrate that the proposed framework effectively benefits from depth information, whether it is ground truth or estimated, and generates better captions. Code, pre-trained models, and the cleaned version of the NYU-v2 dataset will be made publically available.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We thank all the reviewers for their insightful reviews and suggestions. We have updated the paper based on your comments and suggestions. We have revised the paper and look forward to your feedback. A summary of major changes is as follows:
- We improved the discussion about the importance of the task in the introduction section.
- We added the Stanford image paragraph captioning dataset to the datasets section.
- We added experiments on the Stanford image paragraph captioning dataset with predicted depth maps (Table 8).
- We added experiments on the NYU-v2 dataset with predicted depth maps (Table 9).
- We added a discussion about the implication of getting improved performance even from predicted depth maps
- We added more details about the relabelling process in the NYU-v2 dataset cleaning section.
- We added the CIDEr metric to the evaluation metrics,and CIDEr results in Tables 2, 8, and 9.
Assigned Action Editor: ~Antoni_B._Chan1
Submission Number: 712
Loading