Keywords: Medical vision-language models, Retrieval-augmented generation, Adaptive retrieval, Quality-aware context fusion, Factual accuracy
TL;DR: This is a cross-submission work
Abstract: Medical Large Vision Language Models (Med-LVLMs) have advanced automated diagnosis but still generate factually inaccurate responses—a critical flaw in clinical settings. Retrieval-Augmented Generation (RAG) offers a remedy through external knowledge, yet its medical use introduces two core challenges: static retrieval strategies that cannot adaptively trade off context coverage against noise, and over-reliance on retrieved contexts that harm performance even when the model's intrinsic knowledge is correct. To overcome these, we propose DynaRAG (Dynamic Retrieval-Augmented Generation), a novel framework that reimagines multimodal RAG with three synergistic innovations: a Gaussian Mixture Model-based Adaptive Top-K Selection mechanism that replaces heuristic thresholding with probabilistic filtering; a Quality-Aware Context Fusion module that dynamically weights retrieved references using both data-driven confidence and learned utility; and an Adaptive Attention Modulation gate that balances internal knowledge with external evidence during generation. These components are unified under an end-to-end trainable objective that jointly optimizes retrieval, fusion, and generation. Extensive experiments across three medical VQA and report generation benchmarks demonstrate that DynaRAG achieves state-of-the-art performance, improving factual accuracy by an average of 47.4% over strong baselines while significantly mitigating over-reliance on retrieved contexts.
Submission Number: 27
Loading