Efficient Self-Supervised Adaptation of 3D Abdominal Vision-Language Model for Institution-Specific HCC Classification via Full Fine-Tuning and PEFT

03 Dec 2025 (modified: 15 Dec 2025)MIDL 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Model, Self-Supervised Learning, Contrastive Learning, Domain Adaptation, Parameter-Efficient Fine Tuning, Classification
TL;DR: We demonstrate 3D medical VLM can be efficiently adapted for institution-specific HCC classification using only CT-report pairs on 1 GPU, achieving performance comparable to supervised methods from literature without lesion annotations or multi-GPU
Abstract: Medical vision-language models (VLMs) have demonstrated a strong capability in capturing cross-modal relationships between image and text, yet their adaptation to institution-specific clinical tasks remains underexplored. In this study, we fine-tuned a pretrained 3D medical VLM for hepatocellular carcinoma (HCC) classification using paired abdominal CT scans and radiology reports from a different institution and with acquisition characteristics that differ from the model’s original pretraining corpus. We compared two adaptation strategies: full fine-tuning and parameter-efficient fine-tuning (PEFT), motivated by the common use of PEFT to reduce computational cost and enable adaptation under limited-data constraints. Both approaches achieve strong downstream HCC classification performance despite the cross-institutional domain shift, with PEFT reaching an AUC of 0.94 and F1 of 0.91, and full fine-tuning achieving an AUC of 0.95 and F1 of 0.90. These results are competitive with, and in some settings exceed, previously reported supervised HCC classification approaches that rely on lesion-level annotation or segmentation. Full fine-tuning converges rapidly but overfits within a few epochs, whereas PEFT (ConvLoRA for the image encoder and LoRA for the text encoder) attains comparable performance while updating only ∼1% of the model parameters, although requiring more training steps. To better understand adaptation behavior, we also examine the role of contrastive temperature, observing that temperature initialization significantly affects classification performance. This study demonstrates that 3D medical VLM can be efficiently adapted to institution-specific HCC classification using self-supervised CT-report contrastive learning, while highlighting the practical trade-offs between full fine-tuning and parameter-efficient fine-tuning.
Primary Subject Area: Transfer Learning and Domain Adaptation
Secondary Subject Area: Learning with Noisy Labels and Limited Data
Registration Requirement: Yes
Reproducibility: https://github.com/fbrynpk/HCC-Merlin
Visa & Travel: No
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 300
Loading