ERA: Evidence-Based Reasoning and Augmentation for Open-Vocabulary Medical Vision

ICLR 2026 Conference Submission25087 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models (VLMs), Retrieval-Augmented Generation (RAG), Chain-of-Thought (CoT), Open-Vocabulary Medical Imaging (OVMI), Segment Anything Model2 (SAM2)
TL;DR: We introduce ERA, a framework that forces medical Vision-Language Models to reason based on retrieved evidence instead of just guessing. This training-free approach achieves reliable, expert-level performance.
Abstract: Vision-Language Models (VLMs) have shown great potential in the domain of open-vocabulary medical imaging tasks. However, their reliance on implicit correlations instead of explicit evidence leads to unreliable localization and unexplainable reasoning processes. To address these challenges, we introduce ERA (Evidence-Based Reasoning and Augmentation), a novel framework that transforms VLMs from implicit guessers into explicit reasoners for medical imaging. ERA leverages Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) to construct a traceable reasoning path from evidence to results. This framework requires no additional training and can be readily applied on top of any existing Vision-Language Model. Evaluated across multiple challenging medical imaging benchmarks, ERA's performance is comparable to fully-supervised specialist models and significantly surpasses current open-vocabulary baseline methods. ERA provides an effective pathway for building reliable clinical Vision-Language Models.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 25087
Loading