Keywords: multimodal learning, concept-based learning, foundation models, parameter-efficient fine-tuning, medical imaging, survival analysis
TL;DR: We propose a novel prompt-tuning method that uses clinical concepts to dynamically co-adapt radiology and pathology foundation models, improving multimodal fusion and interpretability.
Abstract: Pretrained medical foundation models (FMs) have shown strong generalization across diverse imaging tasks, such as disease classification in radiology and tumor grading in histopathology. While recent advances in parameter-efficient finetuning have enabled effective adaptation of FMs to downstream tasks, these approaches are typically designed for a single modality. In contrast, many clinical workflows rely on joint diagnosis from heterogeneous domains, such as radiology and pathology, where fully leveraging the representation capacity of multiple FMs remains an open challenge. To address this gap, we propose Concept Tuning and Fusing (CTF), a parameter-efficient framework that uses clinically grounded concepts as a shared semantic interface to enable cross-modal co-adaptation before fusion. By incorporating task-specific concepts that are relevant across modalities, CTF aligns radiology and pathology representations, thereby enhancing their complementarity and enabling interpretation. We further design a Global–Context–Shared Prompt (GCSP) mechanism, which employs a small set of learnable tokens to capture domain-specific priors, shared patient-level information, and cross-domain context. The resulting concept alignment scores from each modality are then fused to produce a final prediction. Extensive experiments demonstrate that CTF outperforms strong unimodal, latent-fusion, and adapter-based baselines (e.g., AUC 0.903 on TCGA-GBMLGG). Notably, CTF achieves these gains without finetuning the full FMs, requiring only 0.15\% additional parameters, thus highlighting the effectiveness of concept-based multimodal co-adaptation. Our code is available at: https://github.com/HKU-MedAI/CTF.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6665
Loading