Open-Vocabulary Video Scene Graph Generation via Union-aware Semantic Alignment

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Video Scene Graph Generation (VidSGG) plays a crucial role in various visual-language tasks by providing accessible structured visual relation knowledge. However, the requirement of annotating all categories of prevailing VidSGG methods limits their application in real-world scenarios. Despite the popular VLMs facilitating preliminary exploration of open-vocabulary VidSGG tasks, the correspondence between visual union regions and relation predicates is usually ignored. Therefore, we propose an Open-vocabulary VidSGG framework named Union-Aware Semantic Alignment Network (UASAN) to explore the alignment between visual union regions and relation predicate concepts in the same semantic space. Specifically, a visual refiner is designed to acquire open-vocabulary knowledge and the ability to bridge different modalities. To achieve better alignment, we first design a semantic-aware context encoder to achieve a comprehensive semantic interaction between object trajectories, visual union regions, and trajectory motion information to obtain semantic-aware union region representations. Then, a union-relation alignment decoder is utilized to generate the discriminative relation token for each union region for final relation prediction. Extensive experimental results on two benchmark datasets show that our UASAN achieves significant performance over existing methods, which also verifies the necessity of modeling union region-predicate alignment in the VidSGG pipeline. Code is available in supplementary material.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: Video Scene Graph Generation (VidSGG) task aims to predict the visual relationships between different visual entity trajectories in a given video, constructing the relationships as textual relation triplets in the form of \emph{<subject-predicate-object>}. It serves a crucial role in various visual-language tasks, such as visual question answering, video retrieval, and video captioning, by furnishing structured knowledge to enhance video understanding. However, existing VidSGG methods remain constrained to recognizing objects and predicting visual relations within closed-set scenarios. Therefore, we propose a novel Ov-VidSGG framework named Union-Aware Semantic Alignment Network to explicitly model the alignment between visual union regions and relation predicates for comprehensive and robust relation prediction. Our proposed method based on the corss-modal interaction between visual and textual modalities, and aims to obtain structured textual relation triplets from videos.
Supplementary Material: zip
Submission Number: 2357
Loading