Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI

Cristian Cosentino, Hugo Georgenthum, Fabrizio Marozzo, Pietro Lio

Published: 27 Nov 2025, Last Modified: 09 Dec 2025ML4H 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Surgical video analysis Multimodal learning Vision Transformers Clip-level captioning Large Language Models (LLMs) Automated surgical reporting Explainable AI (XAI) Clinical decision support
TL;DR: We present a multimodal framework that combines vision transformers and large language models to automatically generate accurate and explainable surgical video reports.
Track: Proceedings
Abstract: Automatic summarization of surgical videos is critical for improving procedural documentation, supporting surgical training, and facilitating post-operative analysis. Despite recent advances in computer vision and natural language processing, most existing methods either focus on tool detection or clip-level captioning, lacking an integrated approach that produces full, clinically meaningful reports. We introduce a multimodal framework that leverages visual transformers and large language models to generate comprehensive surgical video summaries. The method unfolds in three stages: (i) extraction of frame-level features to capture tools, tissues, and surgical actions, (ii) integration of temporal context through a ViViT-based encoder combined with frame-level captions, and (iii) synthesis of clip-level descriptions into structured surgical reports using a dedicated LLM. We evaluate the framework on the CholecT50 dataset of 50 laparoscopic videos, achieving 96\% precision in tool detection and a BERT score of 0.74 for temporal summarization. These results demonstrate the potential of combining computer vision and language models to advance AI-assisted reporting, offering a step toward reliable, interpretable, and efficient clinical documentation.
General Area: Applications and Practice
Specific Subject Areas: Natural Language Processing, Medical Imaging, Explainability & Interpretability, Public & Social Health
Data And Code Availability: Yes
Ethics Board Approval: No
Entered Conflicts: I confirm the above
Anonymity: I confirm the above
Submission Number: 30
Loading