Learning Joint General and Specific Representation with Masked Auto-Encoder for Radiology Report Generation

Published: 2025, Last Modified: 08 Jan 2026ICANN (4) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: As an essential cross-modal task, radiology report generation (RRG) has drawn much attention recently. The existing approaches normally rely on large-scale annotated data (coupled image-report pairs) to generate satisfying radiology reports. As we know, it is time-consuming and labor-intensive to build these large-scale annotated datasets and the data demand of the growing model increases simultaneously. Thanks to the global understanding and proper unlabeled exploitation of masked auto-encoder (MAE), we attempt to extend MAE to learn both global and specific features for RRG, probably with unlabeled image information. In this way, we can alleviate the annotation burden and leverage less labeled data to achieve competitive performance. Therefore, we propose a novel approach, namely joint MAE (JMAE), to simultaneously mine the global and specific representation for RRG. Moreover, we also intend to inject potential unlabeled Chest X-ray images to augment our joint model. Extensive experiments on two widely used datasets demonstrate that our approach can outperform many representative baselines, and detailed analysis further shows the feasibility and effectiveness of our approach.
Loading