InterRVOS: Interaction-aware Referring Video Object Segmentation

Published: 2025, Last Modified: 11 Nov 2025CoRR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Referring video object segmentation (RVOS) aims to segment objects in a video described by a natural language expression. However, most existing approaches focus on segmenting only the referred object (typically the actor), even when the expression clearly describes an interaction involving multiple objects with distinct roles. For instance, "A throwing B" implies a directional interaction, but standard RVOS segments only the actor (A), neglecting other involved target objects (B). In this paper, we introduce Interaction-aware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on the modeling of interactions. It requires the model to segment the actor and target objects separately, reflecting their asymmetric roles in an interaction. This task formulation enables fine-grained understanding of object relationships, as many video events are defined by such relationships rather than individual objects. To support this task, we propose a new evaluation protocol that separately evaluates actor and target segmentation, enabling more accurate assessment of the model's ability to distinguish and segment actor and target roles. We also present InterRVOS-127K, a large-scale dataset with over 127K automatically annotated expressions, including interaction expressions annotated with distinct masks for actor and target objects. Furthermore, we develop ReVIOSa, an MLLM-based architecture that introduces interaction-aware special tokens and leverages an attention mask loss to enhance role-specific segmentation. Extensive experiments show that ReVIOSa not only outperforms existing baselines on our proposed InterRVOS-127K evaluation set, but also achieves strong performance on standard RVOS benchmarks. Our project page is available at: https://cvlab-kaist.github.io/InterRVOS.
Loading