Toward Interpretable 3D Diffusion in Radiology: Token-Wise Attribution for Text-to-CT Synthesis

10 Apr 2025 (modified: 12 Apr 2025)MIDL 2025 Short Papers SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D Medical Image Synthesis, Text-to-Image Diffusion, Explainable AI (XAI)
TL;DR: We introduce a lightweight token-wise attribution method for 3D diffusion models that reveals how individual text tokens influence CT generation, addressing key limitations of current explainability techniques in high-dimensional medical imaging.
Abstract: Diffusion-based generative models have emerged as powerful tools for synthesizing anatomically realistic computed tomography (CT) scans from free-text prompts but remain opaque when delineating token influence on the conditioned CT volume. This lack of interpretability limits their clinical applicability, trustworthiness, and adoption across diagnostic and decision-support scenarios. We present a token-wise voxel attribution method for 3D text-to-image diffusion models that leverages cross-attention in U-Net–based architectures to extract individual token attention maps for synthetic CT scans. Our method visualizes individual, joint, or aggregated token-level voxel attributions during CT synthesis, helping to alleviate concerns about model transparency. This lays the groundwork for practical methods and structured explanations illustrating what aspects of attribution work well, where current limitations lie, and how researchers might approach explainable AI for 3D text-to-image diffusion models in radiology moving forward.
Submission Number: 38
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview