How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?

Rahul Thapa; Andrew Li; Qingyang Wu; Bryan He; Yuki Sahashi; Christina Binder; Angela Zhang; Ben Athiwaratkun; Shuaiwen Leon Song; David Ouyang; James Zou

How Well Can General Vision-Language Models Learn Medicine By Watching Public Educational Videos?

Rahul Thapa, Andrew Li, Qingyang Wu, Bryan He, Yuki Sahashi, Christina Binder, Angela Zhang, Ben Athiwaratkun, Shuaiwen Leon Song, David Ouyang, James Zou

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision-language models, biomedicine, datasets, evaluations

TL;DR: Instruction-tuning Qwen-2-VL on 1,031 hours of pedagogical biomedical videos dramatically boosts video and image understanding and includes new expert-curated benchmarks, with all data and code released.

Abstract: Publicly available biomedical videos, such as those on YouTube, serve as valuable educational resources for medical students. Unlike standard machine learning datasets, these videos are designed for human learners, often mixing medical imagery with narration, explanatory diagrams, and contextual framing. In this work, we investigate whether such pedagogically rich, yet non-standardized and heterogeneous videos can effectively teach general-domain vision-language models biomedical knowledge. To this end, we introduce OpenBiomedVid, a biomedical video instruction tuning dataset comprising 1031 hours of video-caption and Q/A pairs, curated through a multi-step human-in-the-loop pipeline. Diverse biomedical video datasets are rare, and OpenBiomedVid fills an important gap by providing instruction-style supervision grounded in real-world educational content. Surprisingly, despite the informal and heterogeneous nature of these videos, the fine-tuned Qwen-2-VL models exhibit substantial performance improvements across most benchmarks. The 2B model achieves gains of 98.7% on video tasks, 71.2% on image tasks, and 0.2% on text tasks. The 7B model shows improvements of 37.09% on video and 11.2% on image tasks, with a slight degradation of 2.7% on text tasks compared to their respective base models. To address the lack of standardized biomedical video evaluation datasets, we also introduce two new expert curated benchmarks, MIMICEchoQA and SurgeryVideoQA. On these benchmarks, the 2B model achieves gains of 99.1% and 98.1%, while the 7B model shows gains of 22.5% and 52.1%, respectively, demonstrating the models' ability to generalize and perform biomedical video understanding on cleaner and more standardized datasets than those seen during training. These results suggest that educational videos created for human learning offer a surprisingly effective training signal for biomedical VLMs. We release OpenBiomedVid, MIMICEchoQA, SurgeryVideoQA, the fine-tuned models, and the complete codebase to support future research.

Primary Area: datasets and benchmarks

Submission Number: 22059

Loading