Abstract: In this work, we introduce DOPRA, a novel approach designed to mitigate hallucinations in multi-modal large language models (MLLMs). Unlike existing solutions that typically involve costly supplementary training data or the integration of external knowledge sources, DOPRA innovatively addresses hallucinations by decoding-specific weighted layer penalties and redistribution, offering an economical and effective solution without the need for additional resources. DOPRA is grounded in unique insights into the intrinsic mechanisms controlling hallucinations within MLLMs, especially the models' tendency to over-rely on a subset of summary tokens in the self-attention matrix, neglecting critical image-related information. This phenomenon is particularly pronounced in certain strata. To counteract this over-reliance, DOPRA employs a strategy of weighted overlay penalties and redistribution in specific layers, such as the 12th layer, during the decoding process. Furthermore, DOPRA includes a retrospective allocation process that re-examines the sequence of generated tokens, allowing the algorithm to reallocate token selection to better align with the actual image content, thereby reducing the incidence of hallucinatory descriptions in auto-generated captions. Overall, DOPRA represents a significant step forward in improving the output quality of MLLMs by systematically reducing hallucinations through targeted adjustments during the decoding process.
Primary Subject Area: [Generation] Multimedia Foundation Models
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: The relevance of this work to the theme of ACMMM2024, "Multimedia in the Generative AI Era" is very high. By proposing the DOPRA approach, this research aims to address the problem of perceptual hallucinations generated by multimodal large language models when processing complex inputs.This contribution enables multimedia applications to utilize generative AI-based techniques more effectively, thus advancing the development and application prospects of the field. By integrating language models and multimedia data, we provide new perspectives and ideas for the design and application of multimedia-based models, and provide useful references and lessons for further exploration and innovation in the field of multimodal data processing.
Supplementary Material: zip
Submission Number: 2414
Loading