Track: Regular papers (within 8 pages excluding appendix)
Keywords: Keywords: medical visual reasoning, visual question answering, explainable AI, multimodal learning, prompting framework
Abstract: Reasoning transparency and accuracy are critical to the implementation of AI algorithms in medical applications. However, modern medical visual-language models (VLMs) often generate conclusions without explicit reasoning, limiting clinician trust and potentially compromising the quality of diagnosis. Reasoning-focused VLMs remain confined to basic VQA datasets (e.g., A-OKVQA), while medical VLMs lack reasoning transparency, modularity, and output refinement capabilities. We introduce Medical Visual Chain-of-Thought Processing (MedVCTP), a training-free framework implementing a structured See–Think–Confirm pipeline. The See stage extracts global and regional visual concepts via advanced visual encoders. The Think stage generates reasoning-grounded answers through LLM-based chain-of-thought processing. The Confirm stage iteratively refines rationales via multi-shot prompting and cross-modal CLIP-based consistency checks, aligning reasoning with visual context to mitigate hallucination and enable visual grounding. Our modular design supports rapid deployment with interchangeable components for scalable performance. On SLAKE, MedVCTP achieves 85.8\% accuracy-a 19.4\% improvement over ablations without CLIP refinement—demonstrating that iterative cross-modal validation directly enhances both accuracy and reasoning coherence. These results establish MedVCTP as a step toward reliable, explainable medical visual reasoning systems deployable without task-specific training. Code and artifacts are available at https://github.com/Carrote-s/MedVCTP.
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Submission Number: 19
Loading