Leave the Bias in Bias: Mitigating the Label Noise Effects in Continual Visual Instruction Fine-Tuning

Published: 2025, Last Modified: 09 Jan 2026ICME 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In recent years, multimodal large language models (MLLMs) with vision processing capability have shown substantial advancements, excelling particularly in interpreting general images. Their application in domain-specific tasks, like those in the medical fields, is further enhanced through continuous visual instruction fine-tuning (CVIF). Despite these advancements, a significant challenge arises from label noise encountered during the collection of domain-specific data. Our studies reveal that this label noise can adversely affect the learning of vision projection embeddings and contribute to inaccuracies in LLMs’ fine-tuning, often leading to hallucinations. In this paper, we introduce a novel framework designed to minimize the impact of label noise. Our approach focuses on stabilizing the learning of vision embeddings and reducing the effect of label noise through the inherent semantic understanding of uncertainty in LLMs. Extensive experiments demonstrate that our framework maintains robust performance in general visual question-answer (VQA) tasks while showing significant effectiveness in medical VQA tasks. To the best of our knowledge, this is the first study to specifically address and analyze the impact of label noise in CVIF.
Loading