VideoMaMa: Mask-Guided Video Matting via Generative Prior

Sangbeom Lim, Seoung Wug Oh, Gabriel, Heeji Yoon, Seungryong Kim, Joon-Young Lee

Published: 05 Jun 2026, Last Modified: 09 May 2026CVPR 2026EveryoneCC BY-NC 4.0

Abstract: Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model VideoMaMa that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video MA-V dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MAV to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos. These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research. All models and the MAV dataset will be publicly released.