No Tokens Wasted: Leveraging Long Context in Biomedical Vision–Language Models

Min Woo Sun; Alejandro Lozano; Javier Gamazo Tejero; Vishwesh Nath; Xiaoxiao Sun; James Burgess; Yuhui Zhang; Kun Yuan; Robert Tibshirani; Sean D. Huver; Serena Yeung-Levy

No Tokens Wasted: Leveraging Long Context in Biomedical Vision–Language Models

Min Woo Sun, Alejandro Lozano, Javier Gamazo Tejero, Vishwesh Nath, Xiaoxiao Sun, James Burgess, Yuhui Zhang, Kun Yuan, Robert Tibshirani, Sean D. Huver, Serena Yeung-Levy

Published: 27 Nov 2025, Last Modified: 28 Nov 2025ML4H 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Biomedical Vision-Language Models, Long-context Modeling, Contrastive Learning

Track: Findings

Abstract: Embedding vision–language models (VLMs) are typically pretrained with short text windows ($<$77 tokens), which forces the truncation of long-format captions. Yet, the distribution of biomedical captions from large-scale open source literature reveals that a huge portion of captions far exceed 77 tokens. To this end, we investigate the impact of pretraining on long-format biomedical captions by extending the context length of text encoders in VLMs. We find that longer context (thus, enabling additional supervision provided in long-format captions) correlates with better retrieval and classification performance. Given this finding, we introduce BIOMEDICA-LongCAP, a dataset of 1M image–caption pairs enriched with context-aware descriptions from full-text articles, providing longer and additional textual supervision. Using BIOMEDICA-LongCAP, we train BMC-LongCLIP, a long-context biomedical VLM with a text encoder supporting windows of up to 512 tokens. Our model extends context capacity by 6.6×, reducing token waste from 55\% to just 2.2\%. On long-caption retrieval benchmarks, BMC-LongCLIP achieves up to +30\% absolute gains in Recall@1 and +2\% average improvements in classification, while also converging faster than short-context. Our results demonstrate that long-context modeling is a promising direction for advancing biomedical VLMs.

General Area: Applications and Practice

Specific Subject Areas: Foundation Models, Medical Imaging, Representation Learning

PDF: pdf

Data And Code Availability: Yes

Ethics Board Approval: No

Entered Conflicts: I confirm the above

Anonymity: I confirm the above

Code URL: https://github.com/minwoosun/open_clip_bmc

Submission Number: 256

Loading