CoVT-CXR: Building Chain of Visual Thought for Interpretable Chest X-Ray Diagnosis

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: chain of visual thought, multimodal understanding, fine-grained dataset, medical report generation, interpretable LLM.
Abstract: Though clinical report generation demonstrates the potential to improve the efficiency of radiologist workflow and benefits the under-served regions, automated analysis of radiographs suffers from un-interpretable progress and inaccurate results. To this end, we propose a novel Chain-of-Visual-Thought (CoVT) to emulate doctors' multi-modal reasoning, enabling more interpretable and accurate CXR diagnostic predictions with explicit multi-step intermediate guidance. Specifically, we mimic the multi-modal multi-step reasoning procedure of the doctors by breaking down clinical reports into individual descriptions and connecting each rationale to corresponding visual prompts—like masks, landmarks, linestrips, and bounding boxes—to illuminate the visual reasoning behind radiographs. By further dividing this association into cross-modal sub-tasks, CoVT is able to exploit a multi-stage fine-tuning protocol to gradually develop the chain-of-reasoning capability. To support this approach, we introduce CoVT-CXR, the first detailed-aligned, multi-step cross-modal dataset for diagnostic tasks, featuring about 3M instruction-following data points for pretraining and around 30K reasoning sequences for fine-tuning, sourced from 6K patient cases and annotated by 32 medical trainees using our tailored tool. Our CoVT-CXR covers more than 20 diseases, requiring 1 to 12 reasoning steps for diagnoses. Through a series of experiments on our CoVT-CXR, we demonstrate the advantages of the CoVT method over baseline approaches, validate the quality of our annotated data, and highlight the positive impacts of CoVT-CXR on various clinical-related tasks. Our CoVT model, annotation tool, and CoVT-CXR dataset will be fully available upon acceptance.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9886
Loading