EndoVLA: Dual-Phase Vision-Language-Action for Precise Autonomous Tracking in Endoscopy

Published: 08 Aug 2025, Last Modified: 16 Sept 2025CoRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision–Language–Action, Continuum Robots, Autonomous Endoscopic Tracking, Reinforcement Learning
Abstract: In endoscopic procedures, autonomous tracking of abnormal regions and following of circumferential cutting markers can significantly reduce the cognitive burden on endoscopists. However, conventional model-based pipelines are fragile—each component (e.g., detection, motion planning) requires manual tuning and struggles to incorporate high-level endoscopic intent, resulting in poor generalization across variable scenes. Vision–Language–Action (VLA) models, which integrate visual perception, language grounding, and motion planning within an end-to-end framework, offer a promising alternative to semantically adapt to surgeon prompts, without the need for manual recalibration. Despite their potential, applying VLA models to robotic endoscopy presents unique challenges due to the inherently complex and dynamic anatomical environments of the gastrointestinal (GI) tract. To this end, we introduce EndoVLA, designed specifically for continuum robots in GI interventions. Provided endoscopic images and surgeon-issued tracking prompts, EndoVLA performs three core tasks: (1) polyp tracking, (2) delineation and following of abnormal mucosal regions, and (3) adherence to predefined circular markers during circumferential cutting. To address the unique challenges posed by data scarcity and domain shifts, we propose a dual-phase strategy, with supervised fine-tuning on our EndoVLA-Motion dataset and reinforcement fine-tuning using task-aware rewards. Our approach significantly enhances the tracking performance in endoscopy, and zero-shot generalization of tracking in general scenes and more challenging sequential tasks.
Spotlight: zip
Submission Number: 589
Loading