Abstract: Current prevailing Video Object Segmentation (VOS) methods follow the pipeline of extraction-then-matching, which first extracts features on current and reference frames independently, and then performs dense matching between them. This decoupled pipeline limits information propagation between frames to high-level features, and fails to capture fine-grained details for matching. Furthermore, the pixel-wise matching lacks holistic target understanding, making it prone to disturbance by similar distractors. To address these issues, we propose a unified VOS framework, coined JointFormer, for jointly modeling feature extraction, correspondence matching, and a compressed memory. The Joint Modeling Block leverages attention operations to simultaneously extract and propagate the target information from the reference frame to the current frame and a compressed memory token.This joint modeling scheme enables extensive multi-layer propagation beyond high-level feature space and facilitates robust instance-distinctive feature learning. In addition, to incorporate the long-term and holistic target information, we introduce a compressed memory token with a customized online updating mechanism, which aggregates target features and performs temporal information propagation in a frame-wise manner, enhancing the global modeling consistency. Our JointFormer achieves a new state-of-the-art performance on the DAVIS 2017 val/test-dev (89.7% and 87.6%) benchmarks and the YouTube-VOS 2018/2019 val (87.0% and 87.0%) benchmarks. To demonstrate the generalizability of JointFormer, it is further evaluated on four new benchmarks with various challenges, including MOSE for complex scenes, VISOR for egocentric videos, VOST for complex transformations, and LVOS for long-term videos. Without specific design to address these unusual difficulties, our model achieves the best performance across all benchmarks when compared with several current best models, illustrating its excellent generalization and robustness. Further extensive ablations and visualizations indicate our JointFormer enables more comprehensive and effective feature learning and matching.
External IDs:dblp:journals/pami/ZhangCWW25
Loading