MedScope: Incentivizing "Think with Videos" for Clinical Reasoning via Coarse-to-Fine Tool Calling

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We present MedScope, a tool-using medical video LMM that learns evidence-seeking for long videos and achieves SOTA with traceable visual grounding.
Abstract: Long-form clinical videos are central to visual evidence-based decision-making, with growing importance for applications such as surgical robotics and related settings. However, current multimodal large language models typically process videos with passive sampling or weakly grounded inspection, which limits their ability to iteratively locate, verify, and justify predictions with temporally targeted evidence. To close this gap, we propose **MedScope**, a tool-using clinical video reasoning model that performs coarse-to-fine evidence seeking over long-form procedures. By interleaving intermediate reasoning with targeted tool calls and verification on retrieved observations, MedScope produces more accurate and trustworthy predictions that are explicitly grounded in temporally localized visual evidence. To address the lack of high-fidelity supervision, we build **ClinVideoSuite**, an evidence-centric, fine-grained clinical video suite. We then optimize **MedScope** with **G**rounding-**A**ware **G**roup **R**elative **P**olicy **O**ptimization (**GA-GRPO**), which directly reinforces tool use with grounding-aligned rewards and evidence-weighted advantages. On full and fine-grained video understanding benchmarks, **MedScope** achieves state-of-the-art performance in both in-domain and out-of-domain evaluations. Our approach illuminates a path toward medical AI agents that can genuinely “think with videos” through tool-integrated reasoning. Code and resources are available at https://github.com/SII-WenjieLisjtu/MedScope.
Lay Summary: Long clinical videos, such as recordings of surgeries or endoscopic procedures, often contain important evidence that appears only briefly. Most current AI systems analyze these videos by sampling a few frames or clips, which means they may miss the key moment and still give an answer without clear visual support. We developed MedScope, an AI model that reviews clinical videos more like a human expert. It first looks broadly to understand the procedure, then searches specific time windows and checks individual frames before answering. To train and evaluate this behavior, we built ClinVideoSuite, a clinical video resource that links questions and answers to the visual evidence needed to support them. We also designed a training method that rewards the model not only for answering correctly, but also for finding the right visual evidence. Across multiple medical video benchmarks, MedScope outperformed existing open-source models and remained effective on videos from different sources. This work is a step toward medical AI systems that can provide more transparent, evidence-grounded support for clinical video understanding. However, such systems still require appropriate validation and human oversight before any real-world clinical use.
Link To Code: https://github.com/SII-WenjieLisjtu/MedScope
Primary Area: Applications->Health / Medicine
Keywords: Large Vision-Language Models (LVLMs), Medical Video Reasoning, Visual Chain-of-Thought, Agentic Reinforcement Learning
Flagged For Ethics Review: true
Originally Submitted PDF: pdf
Submission Number: 2873
Loading