Keywords: OOD Generalization · Medical VLMs · Distribution Shifts
Abstract: Medical vision-language models (VLMs) offer promise for
clinical decision support, yet their reliability under distribution shifts remains
a major concern for safe deployment. These models often learn
task-agnostic correlations due to variability in imaging protocols and
free-text reports, limiting their generalizability and increasing the risk
of failure in real-world settings. We propose DRiFt, a structured feature
decoupling framework that explicitly separates clinically relevant signals
from task-agnostic noise using parameter-efficient tuning (LoRA)
and learnable prompt tokens. To enhance cross-modal alignment and
reduce uncertainty, we curate high-quality, clinically grounded imagetext
pairs by generating captions for a diverse medical dataset. Our
approach improves in-distribution performance by +11.4% Top-1 accuracy
and +3.3% Macro-F1 over prior prompt-based methods, while
maintaining strong robustness across unseen datasets. Ablation studies
reveal that disentangling task-relevant features and careful alignment
significantly enhance model generalization and reduce unpredictable behavior
under domain shift. These insights contribute toward building
safer, more trustworthy VLMs for clinical use. The code is available at
https://github.com/rumaima/DRiFt.
Submission Number: 12
Loading