Abstract: Existing radiology report generation (RRG) studies mostly adopt autoregressive (AR) approaches to produce textual descriptions token-by-token for specific clinical radiographs, where they are susceptible to error propagation problems if irrelevant contents are half-way generated, leading to potential ill-presenting of precise diagnoses, especially when there exist complicated abnormalities in radiographs. Although the non-AR paradigm, e.g., diffusion model, provides an alternative solution to tackle the problem from AR by generating all contents in parallel, the mechanism of using Gaussian noise in existing diffusion models still has a significant room to improve when such models are used in particular circumstances, i.e., providing proper guidance in controlling noises in the diffusive process to ensure precise report generation. In this paper, we propose to conduct RRG with diffusion networks by controlling the noise with task-specific features, which leverages irrelevant visual and textual information as noise rather than the stochastic Gaussian noise, and allows the diffusion networks to filter particular information through iterative denoising, thus performing a precise and controlled report generation process. Experiments on IU X-Ray and MIMIC-CXR demonstrate the superiority of our approach compared to strong baselines and state-of-the-art solutions. Human evaluation and noise type analysis show that comprehensive noise control greatly helps diffusion networks to refine the generation of global and local report contents.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: This paper focuses on designing a new approach for the radiology report generation task, which aims to generate text (i.e., findings) based on the given medical image (i.e., radiograph). Our paper falls into the topic of "Vision and Language" under the "Multimedia Content Understanding" theme. Therefore, the topic of the paper matches the topics required by the ACM Multimedia conference.
Submission Number: 4178
Loading