The Effect of Data Corruption on Multimodal Long Form Responses

Daniel Z Kaplan; Alexis Roger; Mohamed Osman; Irina Rish

The Effect of Data Corruption on Multimodal Long Form Responses

Daniel Z Kaplan, Alexis Roger, Mohamed Osman, Irina Rish

Published: 03 Jul 2024, Last Modified: 15 Jul 2024ICML 2024 FM-Wild Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision language models, Data corruption, Hallucinations

TL;DR: We investigate the effect of training data corruption on vision-language model performance and hallucinations.

Abstract: Despite significant progress, Vision-Language Models (VLMs) still struggle with hallucinations, especially in long-form responses. Existing strategies have had limited successes in specific cases, and long-form generation remains problematic. In this work we attempt to establish the link between the data used to train the model and the hallucinations in the model's output. To this end, we examine hallucinations through data corruption. We develop a method to corrupt training data and then train models with this data to see the effect on performance. We will show that corrupting only a small portion of the long-form training data significantly impairs the performance of the model on long-form tasks, while leaving simpler tasks like visual question-answering and multiple choice relatively intact. All training code and models are released for reproducibility and future research.

Submission Number: 117

Loading