Learning Action-Conditioned World Models for Cataract Surgery from Unlabeled Videos

26 Nov 2025 (modified: 15 Dec 2025)MIDL 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Surgical Video Analysis, World Models, Latent Actions, SSL
TL;DR: SurgWorld learns an action-conditioned world model from unlabeled cataract videos using latent action tokens, improving step recognition and anticipation over state-only pretraining
Abstract: Vision foundation models have enabled automated analysis of cataract surgery videos, but existing self-supervised approaches treat video as state-only sequences, limiting causal reasoning and sample efficiency in label-scarce settings. We present $\textit{SurgWorld}$, an action-conditioned world model that learns surgical dynamics from unlabeled cataract videos by combining a Latent Action Tokenizer, which discretizes frame-to-frame motion into atomic action primitives, with a latent predictor trained on top of a frozen cataract foundation encoder. By modeling state transitions in feature space conditioned on inferred actions rather than generating pixels, $\textit{SurgWorld}$ separates tool motion from static anatomy and learns a latent control signal that is complementary to visual appearance. Pretrained on a multi-institutional corpus and evaluated on four cataract datasets, $\textit{SurgWorld}$ improves step recognition accuracy over state-only baselines, with gains of about 10 percentage points in low-data regimes, indicating that explicit dynamics provide a sample-efficient prior. Ablation studies show that action-only features are already discriminative, and that fusing actions with vision encoder features achieves state-of-the-art performance and consistent improvements in step anticipation. These results support the view that latent actions capture orthogonal temporal structure that describes how cataract procedures progress.
Primary Subject Area: Application: Ophthalmology
Secondary Subject Area: Unsupervised Learning and Representation Learning
Registration Requirement: Yes
Visa & Travel: Yes
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 69
Loading