Abstract: Large-scale pre-trained visual foundation models, such as the Segment Anything Model 2 (SAM2), demonstrate strong performance in video segmentation. However, they require multiple iterations of sophisticated manual prompts to achieve satisfactory results. This paper introduces a method that integrates existing visual foundation models without the need for additional training, named IAP-SAM2, which enables iterative automatic prompting for open-world video segmentation. An innovative automatic prompting mechanism is designed to allow SAM2 to segment target object on videos. Additionally, we propose a multi-round iterative prompt generation strategy based on feature similarity, along with a voting mechanism to refine object segmentation and address occlusion issues in video segmentation. Experimental results show that IAP-SAM2 outperforms existing open-world segmentation approaches on the DAVIS and LVOS datasets, particularly in handling complex videos with multiple targets and object occlusions, while maintaining robust segmentation performance. In the era of emerging foundation models, this work unlocks the potential of these models for automated video segmentation and expands the pathway for leveraging combined foundation models to address real-world challenges.
External IDs:dblp:conf/ijcnn/LiLJCNWL25
Loading