Keywords: Medical visual question answering, Reasoning trajectory clustering, Self-improvement learning
TL;DR: We enhance medical VQA through COMCTS-generated reasoning annotations and a self-improvement framework that filters generated reasoning paths via DTW-based trajectory clustering.
Abstract: While large language models have shown promise in medical applications, their performance in medical visual question answering (VQA) remains limited by insufficient vision-language reasoning capabilities. We address this challenge through two complementary approaches. First, we generate high-quality reasoning annotations for existing medical VQA datasets using COMCTS algorithm. Second, we introduce a self-improvement framework that bootstraps model performance by learning from its own outputs, guided by a small set of high-quality reasoning samples. To optimize this self-improvement process, we propose a novel filtering mechanism based on reasoning trajectory K-medoids clustering, which employs Dynamic Time Warping (DTW) distances to select the most effective generated reasoning paths. Our comprehensive approach demonstrates significant improvements in medical VQA tasks. We release both the COMCTS-generated reasoning datasets and our code to support future research. Our code is available at https://anonymous.4open.science/r/SelfImproving-MedicalVQA-5507
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9134
Loading