Keywords: Cataract, Vision Language Models, Video Understanding
TL;DR: A Vision Language Model (VLM) for answering temporal queries in cataracts surgery
Abstract: Recent studies have shown the effectiveness of Vision Language Models (VLMs) for understanding and analyzing videos in the medical domain and supporting various Question-Answer (QA) tasks. Yet, current VLMs fall short in addressing queries that require temporal reasoning—a critical capability for surgical video understanding. In this work, we introduce CatVLM, a boundary-aware VLM, designed to capture temporal dynamics in untrimmed cataract surgery videos. CatVLM is capable of performing three clinically relevant tasks that demand moment-level awareness: Video Moment Retrieval (VMR), Video Captioning (VC), and Counting. To facilitate the training of such a model, we generate a bank of QA annotations for each task and propose a method to integrate video clips with the timestamps they occur at. To the best of our knowledge, this work is one of the first approaches to explicitly incorporate temporal boundary awareness into VLMs for cataracts as well as the medical domain. We evaluate CatVLM on two public cataract surgery datasets, establishing new baselines across all three tasks. All the code, model checkpoints and annotations will be released post-review
Primary Subject Area: Application: Ophthalmology
Secondary Subject Area: Foundation Models
Registration Requirement: Yes
Visa & Travel: Yes
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 77
Loading