Abstract: Recent advancements in Medical Vision-Language Models (VLMs) have significantly improved medical cross-modal task performance through large-scale contrastive pre-training. However, deploying these large models in clinical settings is hindered by their computational complexity and vulnerability to adversarial attacks. While knowledge distillation offers a solution by transferring knowledge to efficient student models, traditional methods usually ignore the robustness problem, leaving models susceptible to adversarial attacks. To address these challenges, we propose a novel Dynamic Gradient and Hierarchical Feature Alignment framework (DGHFA) for robust knowledge distillation. Our approach introduces a dynamic gradient calibration mechanism for balanced knowledge transfer and a hierarchical adversarial feature alignment framework to enhance robustness under adversarial attacks. Extensive experiments on two medical VLMs and downstream pathology and X-Ray datasets demonstrate that our method outperforms state-of-the-art approaches across multiple attack scenarios, achieving improvements of 2.3 and 1.7% points in robust accuracy, respectively.
External IDs:dblp:conf/miccai/XiaoWZZWWZ25
Loading