Keywords: Esophagitis, Deep Learning, DINO, Upper GI Endoscopy, Self-Supervised
Abstract: Early and accurate detection of esophagitis in upper gastrointestinal endoscopy is essential for guiding targeted treatment and preventing progression to severe diseases such as esophageal cancer. Although deep learning methods have shown promise in supporting esophagitis diagnosis, their performance heavily relies on large amounts of labeled data, which are scarce. Consequently, supervised models often struggle to generalize to the high visual variability and subtle lesion differences encountered in real-world endoscopic examinations. In this work, we study discriminative self-supervised pre-training as a means of leveraging large-scale unlabeled data for robust representation learning. Multiple Vision Transformer models are pre-trained using the DINO framework on 395,201 unlabeled gastrointestinal endoscopy images and subsequently fine-tuned on a curated esophagitis dataset from three clinical centers. Our results demonstrate that self-supervised pre-training on in-domain endoscopic images significantly improves esophagitis detection performance compared to supervised pre-training on natural image datasets such as ImageNet. Specifically, in-domain DINO pre-training yields an average performance gain of 6.60\% in AUCPRC and 2.44\% in AUCROC on the downstream detection task. These findings highlight the importance of in-domain self-supervised learning for reducing annotation dependency and improving model robustness in upper GI endoscopy analysis.
Primary Subject Area: Application: Endoscopy
Secondary Subject Area: Detection and Diagnosis
Registration Requirement: Yes
Reproducibility: https://github.com/tofriede/SSL-Esophagitis-Detection
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 92
Loading