Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization

Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization

ICLR 2026 Conference Submission15633 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Source attribution, Multimodal language models, Trustworthy AI

TL;DR: We introduce a training-free framework that generates clinical summaries with real-time citations to text and images, improving transparency and trust in multimodal AI.

Abstract: Trustworthy clinical summarization requires not only fluent generation but also transparency about where each statement comes from. We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images, overcoming the limitations of post-hoc or retraining-based methods. We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment. Evaluations on two representative domains: clinician-patient dialogues (CliConSummation) and radiology reports (MIMIC-CXR), show that our approach consistently outperforms embedding-based and self-attribution baselines, improving both text-level and multimodal attribution accuracy (e.g., +15% F1 over embedding baselines). Caption-based attribution achieves competitive performance with raw-image attention while being more lightweight and practical. These findings highlight attention-guided attribution as a promising step toward interpretable and deployable clinical summarization systems.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Submission Number: 15633

Loading