CatVLM: Enhancing Temporal Understanding in Cataract Surgery Videos with Boundary-Aware VLM

Jay Nitin Paranjape; Nisarg A Shah; Nanthini Narayanan; Shameema Sikder; S. Swaroop Vedula; Vishal M. Patel

CatVLM: Enhancing Temporal Understanding in Cataract Surgery Videos with Boundary-Aware VLM

Jay Nitin Paranjape, Nisarg A Shah, Nanthini Narayanan, Shameema Sikder, S. Swaroop Vedula, Vishal M. Patel

27 Nov 2025 (modified: 15 Dec 2025)MIDL 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Cataract, Vision Language Models, Video Understanding

TL;DR: A Vision Language Model (VLM) for answering temporal queries in cataracts surgery

Abstract: Recent studies have shown the effectiveness of Vision Language Models (VLMs) for understanding and analyzing videos in the medical domain and supporting various Question-Answer (QA) tasks. Yet, current VLMs fall short in addressing queries that require temporal reasoning—a critical capability for surgical video understanding. In this work, we introduce CatVLM, a boundary-aware VLM, designed to capture temporal dynamics in untrimmed cataract surgery videos. CatVLM is capable of performing three clinically relevant tasks that demand moment-level awareness: Video Moment Retrieval (VMR), Video Captioning (VC), and Counting. To facilitate the training of such a model, we generate a bank of QA annotations for each task and propose a method to integrate video clips with the timestamps they occur at. To the best of our knowledge, this work is one of the first approaches to explicitly incorporate temporal boundary awareness into VLMs for cataracts as well as the medical domain. We evaluate CatVLM on two public cataract surgery datasets, establishing new baselines across all three tasks. All the code, model checkpoints and annotations will be released post-review

Primary Subject Area: Application: Ophthalmology

Secondary Subject Area: Foundation Models

Registration Requirement: Yes

Visa & Travel: Yes

Read CFP & Author Instructions: Yes

Originality Policy: Yes

Single-blind & Not Under Review Elsewhere: Yes

LLM Policy: Yes

Submission Number: 77

Loading