Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video

Published: 01 Jan 2025, Last Modified: 06 Oct 2025WACV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Weakly supervised video object segmentation (WSVOS) enables the identification of segmentation maps without requiring extensive annotations of object masks, relying instead on coarse video labels indicating object presence. WSVOS in surgical videos is, however, more challenging due to the complex interaction of multiple transient objects, such as surgical tools moving in and out of the surgical field. In this scenario, state-of-the-art WSVOS methods struggle to learn accurate segmentation maps. We address this problem by introducing ViDeo Spatio-Temporal disentanglement Networks (VDST-Net), a framework to disentangle complex spatio-temporal object interactions using semi-decoupled knowledge distillation to predict high-quality class activation maps (CAMs). A teacher network is designed to help a temporal-reasoning student network resolve activation conflicts, as the student leverages temporal dependencies when specifics about object location and timing in the video are not provided. We demonstrate the efficacy of our framework on a challenging surgical video dataset where objects are, on average, present in less than 60% of annotated frames, and compare our method to state-of-the-art methods on surgical data and on a public dataset commonly used to benchmark WSVOS. Our method outperforms state-of-the-art techniques and generates accurate segmentation masks under video-level weak supervision. Our code is available at: https://github.com/PCASOlab/VDST-net.
Loading