Abstract: Highlights•A language-guided contrastive learning and data augmentation method for R-VOS.•A sparse attention method to enhance multi-modal alignment.•An improvement over R-VOS baselines with better identification of textual semantics.
Loading