Abstract: Various video understanding tasks have been extensively explored in the multimedia community, among which the video scene graph generation (VidSGG) task is more challenging since it requires identifying objects in comprehensive scenes and deducing their relationships. Existing methods for this task generally aggregate object-level visual information from both spatial and temporal perspectives to better learn powerful relationship representations. However, these leading techniques merely implicitly model the spatial-temporal context, which may lead to ambiguous predicate predictions when visual relations vary frequently. In this work, I propose incorporating spatial-temporal knowledge into relation representation learning to effectively constrain the spatial prediction space within each image and sequential variation across temporal frames. To this end, I design a novel spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism to learn more representative relationship representations. Extensive experiments conducted on Action Genome demonstrate the effectiveness of the proposed STKET.
Loading