Abstract: Automatic recognition of surgical workflow plays a vital role in modern operating rooms. Given the complex nature and extended duration of surgical videos, accurate recognition of surgical workflow is highly challenging. Despite being widely studied, existing methods still face two major limitations: insufficient visual feature extraction and performance degradation caused by inconsistency between training and testing features. To address these limitations, this paper proposes a Multi-Teacher Temporal Regulation Network (MTTR-Net) for surgical workflow recognition. To extract discriminative visual features, we introduce a “sequence of clips” training strategy. This strategy employs a set of sparsely sampled video clips as input to train the feature encoder and incorporates an auxiliary temporal regularizer to model long-range temporal dependencies across these clips, ensuring the feature encoder captures critical information from each frame. Then, to mitigate the inconsistency between training and testing features, we further develop a cross-mimicking strategy that iteratively trains multiple feature encoders on different data subsets to generate consistent mimicked features. A temporal encoder is trained on these mimicked features to achieve stable performance during testing. Extensive experiments on eight public surgical video datasets demonstrate that our MTTR-Net outperforms state-of-the-art methods across various metrics. Our code has been released at https://github.com/kaideH/MGTR-Net
Loading