Abstract: Referring video object segmentation (RVOS) is a task to segment the target objects in videos relying on linguistic expression. In RVOS, objective semantics understanding is key to achieving accurate segmentation due to the unrestricted language expression. Recent methods have utilized different schemes for extracting textual features to capture better objective semantics. Nonetheless, a common issue in partial recent approaches is the semantic misalignment between linguistic and visual information during the simplistic cross-modal fusion process. This misalignment can lead to objective semantic distortion and result in errors and confusion in segmentation. For this, we propose an Effective Feature Representation via Semantic Weight Aggregation (EFR) framework; and in our designation, the EFR begins with three encoders to embed visual and linguistic expressions. Among these, a textual adapter for CLIP is introduced to generate sentence-level features that precisely capture objective semantics. Then, relying on the reliable instances embeddings output from the decoding module guided by the enhanced sentence-level features, we propose a semantic weight aggregation (SWA) scheme that adjusts the cross-modal fused features to ensure the complete and accurate objective semantics preservation according to the decoded instance embeddings, thus further tackling the challenge of objective semantic distortion in the fused features. Following this process, the enhanced fused features undergo post-processing to yield high-quality segmentation. Experimental results on Ref-YouTube-VOS, Ref-DAVIS17, and A2D datasets demonstrate the effectiveness and necessity of the proposed approach.
External IDs:dblp:conf/icic/CaoWZRZ25
Loading