Abstract: Traditional video matting networks depend on user-annotated trimaps to estimate alpha mattes for the foreground in videos. However, creating trimaps is labor-intensive and rigid. Recent advancements in video matting aim to eliminate the need for trimaps, but these methods struggle to estimate alpha mattes for specific individuals in scenes featuring multiple instances. In this study, we propose the Flexible Video Matting (FVM) model, a novel video matting network capable of generating alpha mattes for any specified instance in a video using simple prompts such as text, bounding boxes, and points, without relying on user-annotated trimaps. FVM combines the Segment Anything Model (SAM) and a video object segmentation network to obtain semantic masks for the target instance. Additionally, we have designed a Mask-to-Trimap (MTT) module for FVM, based on a recurrent architecture. This module utilizes semantic masks and temporal information in the video to predict temporally consistent trimaps, which are subsequently fed into the matting module to generate temporally consistent alpha mattes. Experimental results on the video matting benchmark demonstrate that our model achieves state-of-the-art matting quality and exhibits superior temporal coherence compared with methods that directly apply image matting techniques to video matting tasks.
Loading