Abstract: Surgical triplets recognition aims to identify instruments, verbs, and targets in a single video frame, while establishing associations among these components. Since this task has severe imbalanced class distribution, precisely identifying tail classes becomes a critical challenge. To cope with this issue, existing methods leverage knowledge distillation to facilitate tail triplet recognition. However, these methods overlook the low inter-triplet feature variance, diminishing the model’s confidence in identifying classes. As a technique for learning discriminative features across instances, contrastive learning (CL) shows great potential in identifying triplets. Under this imbalanced class distribution, directly applying CL presents two problems: 1) multiple activities in one image make instance feature learning to interference from other classes, and 2) limited training samples of tail classes may lead to inadequate semantic capturing. In this paper, we propose a tail-enhanced representation learning (TERL) method to address these problems. TERL employs a disentangle module to acquire instance-level features in a single image. Obtaining these disentangled instances, those from tail classes are selected to conduct CL, which captures discriminative features by enabling a global memory bank. During CL, we further conduct semantic enhancement to each tail class. This generates component class prototypes based on the global bank, thus providing additional component information to tail classes. We evaluate the performance of TERL on the 5-fold cross-validation split of the CholecT45 dataset. The experimental results consistently demonstrate the superiority of TERL over state-of-the-art methods. Our code is available at https://github.com/CIAM-Group/ComputerVision_Codes/tree/main/TERL.
Loading