Toward Interpretable 3D Diffusion in Radiology: Token-Wise Attribution for Text-to-CT Synthesis

Aidan Bradshaw; Katelyn Morrison; Arpit Mathur; Weicheng Dai; Motahhare Eslami; kayhan Batmanghelich; Adam Perer

Toward Interpretable 3D Diffusion in Radiology: Token-Wise Attribution for Text-to-CT Synthesis

Aidan Bradshaw, Katelyn Morrison, Arpit Mathur, Weicheng Dai, Motahhare Eslami, kayhan Batmanghelich, Adam Perer

Published: 01 May 2025, Last Modified: 06 May 2025MIDL 2025 - Short PapersEveryoneRevisionsBibTeXCC BY 4.0

Keywords: 3D Medical Image Synthesis, Text-to-Image Diffusion, Explainable AI (XAI)

TL;DR: We introduce a lightweight token-wise attribution method for 3D diffusion models that reveals how individual text tokens influence CT generation, addressing key limitations of current explainability techniques in high-dimensional medical imaging.

Abstract: Diffusion-based generative models have emerged as powerful tools for synthesizing anatomically realistic computed tomography (CT) scans from free-text prompts but remain opaque when delineating token influence on the conditioned CT volume. This lack of interpretability limits their clinical applicability, trustworthiness, and adoption across diagnostic and decision-support scenarios. We present a token-wise voxel attribution method for 3D text-to-image diffusion models that leverages cross-attention in U-Net–based architectures to extract individual token attention maps for synthetic CT scans. Our method visualizes individual, joint, or aggregated token-level voxel attributions during CT synthesis, helping to alleviate concerns about model transparency. This lays the groundwork for practical methods and structured explanations illustrating what aspects of attribution work well, where current limitations lie, and how researchers might approach explainable AI for 3D text-to-image diffusion models in radiology moving forward.

Submission Number: 38

Loading