Generating Natural-Language Surgical Feedback: From Structured Representation to Domain-Grounded Evaluation

Firdavs Nasriddinov, Rafal Dariusz Kocielnik, Anima Anandkumar, Andrew Hung

Published: 27 Nov 2025, Last Modified: 09 Dec 2025ML4H 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: surgical training, feedback generation, action triplets, video-understanding, ontology extraction
TL;DR: We generate clinically grounded, trainer-style feedback from surgical video by (1) predicting action triplets from video on an ontology mined from trainer transcripts, and (2) conditioning an LLM on these triplets to produce auditable guidance.
Track: Proceedings
Abstract: High-quality intraoperative feedback from a surgical trainer is pivotal for improving trainee performance and long-term skill acquisition. Automating natural, trainer-style feedback promises timely, accessible, and consistent guidance at scale—but requires models that understand clinically relevant representations. We present a structure-aware pipeline that learns a surgical action ontology from real trainer→trainee transcripts (33 surgeries) and uses it to condition feedback generation. We contribute by 1) mining Instrument-Action-Target (IAT) triplets from real-world feedback text and clustering surface forms into normalized categories, 2) fine-tuning a video→IAT model that leverages the surgical procedure and task contexts as well as fine-grained temporal instrument motion (crucial for representing instruments and actions over time), and 3) demonstrating how to effectively leverage IAT triplet representation to guide GPT-4o in generating clinically grounded, trainer-style feedback. We show that, on Task 1: Video→IAT recognition, our context injection and temporal tracking deliver consistent AUC gains — Instrument: 0.67→0.74, Action: 0.60→0.63, Tissue: 0.74→0.79. For Task 2: Feedback text generation (1 [opposite/unsafe] – 3 [admissible] – 5 [perfect match] fidelity rubric against human trainer), GPT-4o from video alone scores 2.17; IAT conditioning reaches 2.44 (+12.4%), increasing the admissible generations with score ≥3: 21%→42%. Traditional metrics also improve: Word Error Rate (WER): ↓15–31% and ROUGE (phrase/substring overlap): ↑9–64%. Grounding generation in explicit IAT structure improves fidelity and yields clinician-verifiable rationales, supporting auditable use in surgical training.
General Area: Applications and Practice
Specific Subject Areas: Explainability & Interpretability, Natural Language Processing, Representation Learning
Data And Code Availability: Yes
Ethics Board Approval: Yes
Entered Conflicts: I confirm the above
Anonymity: I confirm the above
Submission Number: 145
Loading