Unified Surgical World Model for Structured Understanding, Long-Horizon Prediction, and Fine-Grained Generation

Zhitao Zeng; Guojian Yuan; Junyuan Mao; weitao Du; Xiaojun Jia; Yueming Jin

Unified Surgical World Model for Structured Understanding, Long-Horizon Prediction, and Fine-Grained Generation

Zhitao Zeng, Guojian Yuan, Junyuan Mao, weitao Du, Xiaojun Jia, Yueming Jin

20 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: World Model; Unified Model; Healthcare; Surgical Intelligence

Abstract: World models are capable of learning environment dynamics and supporting long-horizon prediction and data-efficient policy learning by synthesizing plausible rollouts. These models provide a flexible and powerful framework for training agents in environments where data is scarce, annotation is costly, and exploration is constrained. In the context of surgery, the surgical intelligence field faces significant challenges due to the lack of high-quality and diverse multimodal data for training surgical vision-language models, as well as the absence of highly realistic simulators for training surgical robots. Surgical world models address these challenges by both generating multimodal data and serving as a surgical embodied simulator, making them ideal for advancing surgical robotics and intelligence. We propose a Unified Surgical World Model (UniSWM), which unifies structured understanding, long-horizon prediction, and fine-grained generation through a mixture of transformers. UniSWM acts as both a data generator and a simulator for surgical robotics, supporting vision–language and vision–language–action training across in-body and operating room settings. This model integrates structured understanding with discrete action tokens for phase, step, action, and movement, and supports long-horizon prediction for multi-step surgical trajectories. It conditions fine-grained generation on action and movement tokens, aligning frames to deterministic textual descriptions, and eliminates the need for optical flow or kinematic labels. To enable the training of world models, we introduce UniSWM-DB, a diverse multimodal dataset containing 1.81 million samples specifically designed for surgical training. To evaluate the capabilities of UniSWM, we propose UniSWM-Bench, a comprehensive benchmark covering five understanding tasks, two prediction tasks, and three generation tasks. Experimental results demonstrate that UniSWM significantly outperforms existing models, including GPT-5, Gemini-2.5-Pro, and Qwen-VL-Max, excelling in structured understanding, long-horizon prediction, and coherent, controllable visual generation.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 23710

Loading