Selective arguments representation with dual relation-aware network for video situation recognition

Published: 01 Jan 2024, Last Modified: 20 May 2025Neural Comput. Appl. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Argument visual states are helpful for detecting structured components of events in videos, and existing methods tend to use object detectors to generate their candidates. However, directly leveraging object features captured by bounding boxes overlooks a deep understanding of object relations and differences between them and real arguments. In this work, we propose a novel framework to generate selective contextual representations of videos, thereby reducing the interference of useless or incorrect object features. Firstly, we construct grid-based object features as graphs based on the internal grid connection and then use graph convolutional network to execute feature aggregation. Secondly, a weighted geometric attention module is designed to obtain the contextual representation of objects, which explicitly combines visual similarity and geometric correlation with different importance proportions. Then, we propose a dual relation-aware selection module for further feature selection. Finally, we utilize labels as the ladder to bridge the gap between object features and semantic roles, while considering the proximity in the semantic space. Experimental results and extensive ablation studies on the VidSitu indicate that our method effectively obtains a deep understanding of events in videos and outperforms state-of-the-art models.
Loading