Env-QA: A Video Auestion Answering Benchmark for Comprehensive Understanding of Dynamic Environments
Abstract: Visual understanding goes well beyond the study of im- ages or videos on the web. To achieve complex tasks in volatile situations, the human can deeply understand the en- vironment, quickly perceive events happening around, and continuously track objects’ state changes, which are still challenging for current AI systems. To equip AI system with the ability to understand dynamic ENVironments, we build a video Question Answering dataset named Env-QA. Env-QA contains 23K egocentric videos, where each video is composed of a series of events about exploring and in- teracting in the environment. It also provides 85K ques- tions to evaluate the ability of understanding the composi- tion, layout, and state changes of the environment presented by the events in videos. Moreover, we propose a video QA model, Temporal Segmentation and Event Attention network (TSEA), which introduces event-level video representation and corresponding attention mechanisms to better extract environment information and answer questions. Compre- hensive experiments demonstrate the effectiveness of our framework and show the formidable challenges of Env-QA in terms of long-term state tracking, multi-event temporal reasoning and event counting, etc.
0 Replies
Loading