Keywords: Medical Report Generation, Human-AI Collaboartion
Abstract: We introduce the first bidirectional collaborative medical report generation framework to reduce physicians' workload and enhance trustworthiness through targeted physician-AI interaction, where physicians provide feedback only on the most critical parts, and the Vision-Language Model (VLM) propagates these to finalize the full report. The core challenge lies in defining the optimal unit of interaction. We propose the Anatomy-Finding Concept Unit (AFCU), a minimal, clinically grounded semantic statement (e.g., ``left lobe: hypoechoic nodule''), satisfying three key principles: atomicity, lack of ambiguity, and anatomical anchoring. To extract AFCU, we use a Large Language Model (LLM) guided by predefined clinical templates followed by information bottleneck clustering to group lexically diverse but semantically equivalent anatomical concepts (e.g., “left and right lobe” to “both lobes of the thyroid gland”), eliminating redundancy while preserving diagnostic fidelity. To prioritize physician intervention, we introduce the Concept Risk Score (CRS), quantifying behavioral inconsistency (concepts generated regardless of image content) and semantic instability (inconsistent associated findings under image perturbations) via occlusion-based visual grounding. Finally, we propose Holistic Semantic Match (HSM), a concept-based metric that correlates strongly with human judgment (Pearson’s r = 0.846, $p < 0.05$). Experiments show our framework improves semantic quality by 9.13\% HSM across four organs by correcting only one AFCU with high error risk per report -- a minimal, clinically feasible intervention, enabling efficient and trustworthy physician-AI collaboration.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4639
Loading