EndoAssistant: A Large-scale Vision-Language Dataset for Endoscopic Surgery Understanding from Open-Source Videos

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Medical image, endoscopy, vision-language model
TL;DR: We present a large-scale, meticulously curated dataset from surgical endoscopic videos, designed to use image-text pairs to facilitate medical scene understanding.
Abstract: Endoscopic interventions offer a minimally invasive approach, minimizing patient discomfort and facilitating expedited recovery. Proficient training of junior surgeons necessitates the ability to analyze and interpret endoscopic scenes through questioning and answering. Consequently, the development of a robust foundation model for endoscopic visual language understanding holds immense value for medical training and surgical education. However, existing endoscopy vision-language datasets are limited in scale and diversity, consisting of only 50 videos sourced from a few clinical sites, thus posing a significant hurdle to the advancement of generalized and robust artificial intelligence models for endoscopic surgical applications. To address this challenge, we present a large-scale, meticulously curated image-text dataset of surgical endoscopic scenes from expert surgeons, designed to propel a vision-language assistant in medical scene understanding. Encompassing 590 open-source videos spanning more than 91 hours, our curated dataset includes 65,844 unique images, 30,002 unique captions, and 157,589 image-caption/question-answering pairs. This dataset aims to assist the development of automated systems to support medical professionals by mitigating repetitive tasks. We present a comprehensive endoscopic surgery assisting pipeline, (1) a first-ever image-caption dataset specifically for endoscopic scenes; (2) an image-question-answer dataset that offers greater size and diversity compared to existing collections; (3) rigorous evaluation demonstrating its efficacy in downstream surgical endoscopic scene comprehension tasks like classification, retrieval and visual question answering.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7850
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview