Abstract: In the realm of Medical Visual Language Models (VLMs), the quest for universal efficient fine-tuning mechanisms remains paramount, especially given researchers in interdisciplinary fields are often extremely short of training resources, yet largely unexplored.
Most of the current Parameter-Efficient Fine-Tuning(PEFT) methods, not only have not been comprehensively evaluated on Med-VLMs but also mostly focus on adding some components to the model's structure or input. However, fine-tuning intrinsic model components often yields better generality and consistency, and its impact on the ultimate performance of Med-VLMs has been widely overlooked and remains understudied. In this paper, we endeavour to explore an alternative to traditional PEFT methods, especially the impact of fine-tuning LayerNorm and Attention layers on Med-VLM. Our comprehensive study spans both small-scale and large-scale Med-VLMs, evaluating their performance under various fine-tuning paradigms across tasks such as Medical Visual Question Answering and Medical Imaging Report Generation. The findings reveal that fine-tuning solely the LayerNorm layers not only surpasses the efficiency of traditional PEFT methods but also retains the model's accuracy and generalization capabilities across a spectrum of medical downstream tasks. The experiments demonstrate LayerNorm fine-tuning's superior adaptability and scalability, particularly in the context of large-scale medical VLMs. We hope this work will contribute to the ongoing discourse on optimizing efficient fine-tuning strategies for medical VLMs.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: This paper is highly relevant to the conference as it delves into cutting-edge research in the realm of Medical Visual Language Models (Med-VLMs), merging multimedia and artificial intelligence technologies. Medical image and language data typically constitute multimedia content, necessitating the integration of both image and textual information. The paper explores parameter-efficient fine-tuning methods for Med-VLMs in the medical domain, aiming to tackle unique challenges such as limited data and domain-specific requirements. This aligns closely with the themes of multimedia conferences, given the significance of integrating multimedia data processing and AI algorithms, particularly in the medical domain. Additionally, the paper proposes a novel fine-tuning strategy to optimize the performance of Med-VLMs, which is crucial for researchers in multimedia fields facing scarce data resources. Hence, presenting this paper at a multimedia conference would offer valuable insights and methodologies for researchers to address key issues in the medical domain.
Supplementary Material: zip
Submission Number: 571
Loading