RAD3D-Prefix:  Anomaly-Aware Prefix Learning on Frozen LLM for 3D CT Image to Report Generation

Vanshali Sharma; Quoc-Huy Trinh; Debesh Jha; Ulas Bagci

RAD3D-Prefix: Anomaly-Aware Prefix Learning on Frozen LLM for 3D CT Image to Report Generation

Vanshali Sharma, Quoc-Huy Trinh, Debesh Jha, Ulas Bagci

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Report generation, 3D CT scans, medical image analysis

Abstract: Recent advances in multimodal learning, including large language models (LLMs) and vision-language models (VLMs aka foundational models), have demonstrated strong adaptability to natural images. However, extending their use to the medical domain, particularly for volumetric (3D) images, is challenging due to high computational complexity and the need to model volumetric dependencies. The significant misalignment between visual and textual features further limits the ability to leverage the strength of LLMs, and naively fine-tuning these models on limited medical data often leads to overfitting and underperformance on downstream tasks. In this study, we address these challenges for volumetric radiology scans (specifically CT) report generation by introducing a simple, lightweight approach that minimizes the need for extensive parameter training. Our solution, called RAD3D-Prefix, employs a novel anomaly-aware prefix learning module that effectively aligns visual features from 3D images with textual features. This module integrates image embeddings with multi-label diagnostic classification logits, preserving critical clinical details while bridging the vision-language gap. By keeping the LLM frozen, our method requires minimal trainable parameters and mitigates the risk of overfitting on small, domain-specific datasets. Across four different evaluation criteria, RAD3D-Prefix outperforms existing similar-sized models and performs comparably to larger models that have more than five times the number of trainable parameters. Our approach demonstrates superior clinical relevance and out-of-domain generalization, highlighting the effectiveness of our lightweight, anomaly-aware prefix projection module.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 22782

Loading