Abstract: Quantification of real-time informal feedback
delivered by an experienced surgeon to a trainee
during surgery is important for skill improvements in surgical training. Such feedback in the
live operating room is inherently multimodal,
consisting of verbal conversations (e.g., questions and answers) as well as non-verbal elements (e.g., through visual cues like pointing to
anatomic elements). In this work, we leverage
a clinically-validated five-category classification
of surgical feedback: “Anatomic”, “Technical”,
“Procedural”, “Praise” and “Visual Aid”. We
then develop a multi-label machine learning
model to classify these five categories of surgical feedback from inputs of text, audio, and
video modalities. The ultimate goal of our
work is to help automate the annotation of realtime contextual surgical feedback at scale. Our
automated classification of surgical feedback
achieves AUCs ranging from 71.5 to 77.6 with
the fusion improving performance by 3.1%. We
also show that high-quality manual transcriptions of feedback audio from experts improve
AUCs to between 76.5 and 96.2, which demonstrates a clear path toward future improvements. Empirically, we find that the Staged
training strategy, with first pre-training each
modality separately and then training them
jointly, is more effective than training different
modalities altogether. We also present intuitive
findings on the importance of modalities for different feedback categories. This work offers an
important first look at the feasibility of automated classification of real-world live surgical
feedback based on text, audio, and video modalities.
Loading