A Vision Foundation Model for Cataract Surgery Using Joint-Embedding Predictive Architecture

Published: 27 Mar 2025, Last Modified: 01 May 2025MIDL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Surgical Pretraining, Joint Embedding Predictive Network, Cataract Surgery
Abstract: Vision foundation models can automate analysis of surgical videos and enable multiple applications that support patient care and surgical training. For cataract surgery, existing models are limited by reliance on small datasets, privacy concerns, and poor generalizability across surgical settings. In this paper, we introduce JHU-VPT(JEPA), a self-supervised vi- sion foundation model leveraging Joint-Embedding Predictive Architecture (JEPA) to learn spatiotemporal representations via latent feature prediction on a large corpus of unlabeled cataract videos, without requiring extensive labeled datasets or pixel-level reconstruction. JHU-VPT(JEPA) is pretrained on 2591 videos from multiple sites that capture different surgical technique and style variations. Comprehensive evaluations on step recognition, sur- gical feedback, and skill assessment tasks demonstrate that JHU-VPT(JEPA) outperforms existing methods. JHU-VPT(JEPA)’s effectiveness is evident even when using attentive probing with a frozen encoder, highlighting the robustness of the learned features and ad- dressing privacy concerns by not requiring access to raw videos during downstream tasks. Our approach offers a scalable, generalizable, and privacy-preserving solution for surgical video analysis, with significant potential to advance patient care and surgical education.
Primary Subject Area: Foundation Models
Secondary Subject Area: Unsupervised Learning and Representation Learning
Paper Type: Validation or Application
Registration Requirement: Yes
Visa & Travel: Yes
Submission Number: 129
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview