Temporally Aligned Relation Modeling for Panoptic Video Scene Graph Generation

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Panoptic Video Scene Graph Generation; Scene Understanding; Activity Analysis
Abstract: Panoptic Video Scene Graph Generation (PVSG) aims to achieve a comprehensive video understanding by segmenting entities and predicting their temporal relations. These temporal relations vary in duration and evolve dynamically over time. However, existing methods model relations over the entire video sequence, making it difficult to align the perception scope with actual interaction intervals and often introducing irrelevant context. To address this, we propose TempFocusNet (TFNet), a new framework that first localizes the intervals where relations occur and then performs focused context modeling within them, enabling temporally aligned and more accurate relation prediction. Specifically, we extract visual and category semantic features for each entity to construct temporally continuous entity feature tubes. Then, multiple temporal queries interact with paired entity tubes to capture diverse temporal cues and generate candidate relation intervals, which are represented as Gaussian masks to model their temporal structure. Finally, the Gaussian masks guide the temporal focus attention to attend to relevant intervals for relation classification. Extensive experiments show that our TFNet achieves state-of-the-art performance on OpenPVSG and ImageNet-VidVRD datasets. The code of our TFNet will be made available.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6624
Loading