Leave the Bias in Bias: Mitigating the Label Noise Effects in Continual Visual Instruction Fine-Tuning

ACL ARR 2024 June Submission633 Authors

12 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In recent years, multimodal large language models (MLLMs) with vision processing capability have shown substantial advancements, excelling particularly in interpreting general images. Their application in domain-specific tasks, like those in the medical fields, is further enhanced through continuous visual instruction fine-tuning (CVIF). Despite these advancements, a significant challenge arises from label noise encountered during the collection of domain-specific data. Our studies reveal that this label noise can adversely affect the learning of vision projection embeddings and contribute to inaccuracies in LLMs' fine-tuning, often leading to hallucinations. In this paper, we introduce a novel framework designed to minimize the impact of label noise. Our approach focuses on stabilizing the learning of vision embeddings and reducing the effect of label noise through the inherent semantic understanding of uncertainty in LLMs. Extensive experiments demonstrate that our framework maintains robust performance in general visual question-answer (VQA) tasks while showing significant effectiveness in medical VQA tasks. To the best of our knowledge, this is the first study to specifically address and analyze the impact of label noise in CVIF.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering,robustness
Contribution Types: NLP engineering experiment
Languages Studied: natural language,english
Submission Number: 633
Loading