Abstract: Referring video segmentation uses text descriptions to identify and segment objects. This requires the model to effectively perform spatiotemporal modeling of videos under the guidance of linguistic information. However, previous works have not explicitly considered the differences among objects in videos, nor have they adequately aligned language information with object-level features, leading to misclassifications in complex scenes. In this paper, we rethink the relationship between objects and language in videos, introducing a novel graph-based network (SLGN) for referring video segmentation to address the problem. Specifically, we design a Spatiotemporal Graph Perception (SGP) module that uses temporal, semantics, and positional priors to construct multidimensional edges between objects and employs graph convolution to model their spatiotemporal relationships. Meanwhile, we design a Clue Graph Perception (CGP) module that leverages text descriptions and potential objects to construct an object-word graph, achieving modality alignment at the object level. Experimental results demonstrate that our method outperforms recent representative methods in performance.
External IDs:dblp:conf/icmcs/LianLWMZ25
Loading