4th PVUW MeViS 3rd Place Report: Sa2VA

Published: 2025, Last Modified: 02 Dec 2025CoRR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Referring video object segmentation (RVOS) is a challenging task that requires the model to segment the object in a video given the language description. MeViS is a recently proposed dataset that contains motion expressions of the target objects, leading to a challenging benchmark, compared with existing RVOS benchmarks. On the other hand, for referring expression tasks, a new trend is to adopt multi-modal large language model (MLLM) to achieve better image and text alignment. In this report, we show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS. In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos. By enlarging the scope of key frames, without any further training, we can achieve the 3rd place in the 4th PVUW workshop.
Loading