MEM: Mask Enhancement Model for Video Object Segmentation

Islam Abdelfattah, Mohamed S. Shehata

Published: 01 Jan 2024, Last Modified: 06 Mar 2025ISVC (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Video Object Segmentation (VOS) is a crucial task in computer vision, involving the identification, tracking, and segmentation of objects in video sequences. Despite significant advancements, state-of-the-art models often struggle with specific challenges such as occlusions and highly similar objects in close proximity. This paper presents the Mask Enhancement Model (MEM), an approach that combines the strengths of the Segment Anything Model (SAM) and the Cutie model to address these limitations. While SAM excels in zero-shot image segmentation, it is not inherently designed for video sequences. Conversely, Cutie provides robust segmentation using memory-based approaches but faces challenges with occlusions and similar objects. MEM integrates these models to produce an enhanced output mask, resulting in improving performance over each model individually. The proposed solution is evaluated through several experiments, demonstrating state-of-the-art \( \mathcal {J} \& \mathcal {F} \)scores on standard benchmark datasets, outperforming other state-of-the-art models. This paper details the MEM architecture, its integration with SAM and Cutie, and the improvements observed in video object segmentation performance. On DAVIS-17 dataset, MEM improves by 0.7% \( \mathcal {J} \& \mathcal {F} \)on Cuties and 3.2% on DeAOT-R50.