Keywords: Multimodality, Medical image analysis, Report generation
Abstract: Multi-phase 3D contrast-enhanced imaging is indispensable for clinical diagnosis, yet current vision–language models (VLMs) inadequately capture temporal dynamics across imaging phases, thereby limiting their reliability in automated medical report generation. We propose the \textit{Phase-aware Memory Thought} (PhoT) framework, a novel paradigm that integrates temporal progression patterns in multi-phase CT with structured clinical reasoning. PhoT incorporates: (i) phase-aware pretraining to learn temporally aligned visual representations; (ii) parameter-efficient fine-tuning to adapt these representations for report generation; and (iii) a structured inference mechanism (“Phase of Thought”) that leverages diagnostic templates to enhance clinical fidelity. We curate and evaluate PhoT on a large-scale dataset comprising 12,230 multi-phase CT series from 61,332 patient cases. Experimental results demonstrate that PhoT consistently outperforms strong baselines in both retrieval and report generation, achieving superior accuracy and interpretability. This work establishes PhoT as a clinically grounded, temporally aware VLM, advancing automated diagnostic reporting in complex medical imaging scenarios.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 15358
Loading