Domain Generalization for Multiple Video Object Segmentation and Tracking Using Transformers and Smart Memory

Elham Soltani Kazemi, Imad Eddine Toubal, Gani Rahmon, Juan David Mogollon, Kannappan Palaniappan

Published: 2026, Last Modified: 04 May 2026Int. J. Comput. Vis. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video Object Segmentation (VOS) is a key component in computer vision applications, including surveillance, autonomous driving, and robotics. However, existing VOS models often struggle with generalization to new videos with complex, topologically transforming deformable objects (eg. cooking, assembling, state change), degraded environments and long video sequences, resulting in tracking drift, low recall and memory saturation. We developed Multiple object VOS and tracking Smart Memory architecture (MuSMem), a generalizable approach that incorporates three key innovations: (i) fusing SAM with High-Quality masks alongside appearance-based candidate-selection to refine coarse segmentation masks, resulting in improved object boundaries; (ii) dynamic smart memory that manages a history of key frames based on a novel information preserving gain, combined with relevance and freshness spatio-temporal criteria; and (iii) explores the use of monocular depth maps for occlusion robustness. MuSMem significantly reduces memory usage, reduces drift, tracks complex object topological changes and improves long-term prediction performance. MuSMem can be integrated with Vision-Language Models (VLMs) for zero-shot generalization to unseen visual domains. Experiments using VOS benchmark datasets show that MuSMem ranks first on VOTSt-2024, Long Video Dataset and LVOS, and second on VOTS-2024, demonstrating the best generalizability and state-of-the-art performance across single-, multi-, and complex VOS tasks.
Loading