Abstract: Video Copy Localization (VCL) aims to identify all copied segments within untrimmed video pairs. Fully supervised methods, which require annotating the boundaries of copied segments, are labor-intensive and susceptible to distraction from non-copied segments. To address this problem, we propose a more efficient annotation paradigm called "single frame supervision", which only requires two randomly selected timestamps within the copied segments. With single frame supervision, we introduce a method called VCSA. This method combines spatial-temporal perception and fusion modules to extract and fuse features, which are then segmented for contrastive learning. Additionally, we propose grid alignment loss to enhance the ability of model to distinguish copied from non-copied segments. Extensive experiments demonstrate that VCSA outperforms other methods within our paradigm, sometimes even reaching performance levels comparable to fully supervised methods.
External IDs:dblp:conf/icassp/LiHYZ25
Loading