CatVLM: Enhancing Temporal Understanding in Cataract Surgery Videos with Boundary-Aware VLM

Published: 14 Feb 2026, Last Modified: 10 Apr 2026MIDL 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Cataract, Vision Language Models, Video Understanding
TL;DR: A Vision Language Model (VLM) for answering temporal queries in cataracts surgery
Abstract: Recent studies have shown the effectiveness of Vision Language Models (VLMs) for understanding and analyzing videos in the medical domain and supporting various Question-Answer (QA) tasks. Yet, current VLMs fall short in addressing queries that require temporal reasoning—a critical capability for surgical video understanding. In this work, we introduce CatVLM, a boundary-aware VLM, designed to capture temporal dynamics in untrimmed cataract surgery videos. CatVLM is capable of performing three clinically relevant tasks that demand moment-level awareness: Video Moment Retrieval (VMR), Video Captioning (VC), and Counting. To facilitate the training of such a model, we generate a bank of QA annotations for each task and propose a method to integrate video clips with the timestamps they occur at. To the best of our knowledge, this work is one of the first approaches to explicitly incorporate temporal boundary awareness into VLMs for cataracts as well as the medical domain. We evaluate CatVLM on two public cataract surgery datasets, establishing new baselines across all three tasks. All the code, model checkpoints and annotations will be released post-review
Primary Subject Area: Application: Ophthalmology
Secondary Subject Area: Foundation Models
Registration Requirement: Yes
Visa & Travel: Yes
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Midl Latex Submission Checklist: Ensure no LaTeX errors during compilation., Replace NNN with your OpenReview submission ID., Includes \documentclass{midl}, \jmlryear{2026}, \jmlrworkshop, \jmlrvolume, \editors, and correct \bibliography command., Did not override options of the hyperref package., Did not use the times package., Use the correct spelling and format, avoid Unicode characters, and use LaTeX equivalents instead., Any math in the title and abstract must be enclosed within $...$., Did not override the bibliography style defined in midl.cls and did not use \begin{thebibliography} directly to insert references., Avoid using \scalebox; use \resizebox when needed., Included all necessary figures and removed *unused* files in the zip archive., Removed special formatting, visual annotations, and highlights used during rebuttal., All special characters in the paper and .bib file use LaTeX commands (e.g., \'e for é)., No separate supplementary PDF uploads., Acknowledgements, references, and appendix must start after the main content.
Latex Code: zip
Copyright Form: pdf
Submission Number: 77
Loading