Egocentric Vehicle Dense Video Captioning

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Typical dense video captioning mostly concentrates on third-person videos, which are generally characterized by relatively delineated steps among events as seen in edited instructional videos. However, such videos do not genuinely reflect the way we perceive our real lives. Instead, we observe the world from an egocentric viewpoint and witness only continuous unedited footage. To facilitate further research, we introduce a new task, Egocentric Vehicle Dense Video Captioning, in classic first-person driving scenario. This is a multi-modal, multi-task project for a comprehensive understanding of untrimmed, egocentric driving videos. It consists of three sub-tasks that focus on event location, event captioning, and vehicle state estimation separately. For the purpose of accomplishing these tasks, it is necessary to deal with at least three challenges, those are extracting relevant ego-motion information, describing driving behavior and understanding the underlying rationale, as well as resolving the boundary ambiguity problem. In response, we devise corresponding solutions, encompassing a vehicle ego-motion learning strategy and a novel adjacent contrastive learning strategy, which effectively address the aforementioned issues to a certain extent. We validate our method by conducting extensive experiments on the BDD-X dataset, all of which show promising results and achieve new state-of-the-art performance on most metrics, which proves the effectiveness of our approach.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: We propose a new task based on traditional multimodal dense video captioning, termed Egocentric Vehicle Dense Video Captioning, with the aim of resolving the problem of untrimmed video understanding in first-person driving scenarios; We pioneered incorporating an ego-motion information learning approach in DVC. Besides, we design a trimodal contrastive learning strategy for event features learning; We conducted extensive experiments on the BDDX dataset, achieving state-of-the-art results in most metrics, thereby demonstrating the effectiveness of our approach.
Supplementary Material: zip
Submission Number: 3041
Loading