Keywords: Large Multi-modal Models, Object Hallucination, Context Hijacking
TL;DR: Irrelevant visual or textual contexts confuse Large Multi-modal Models to produce biased output.
Abstract: Recently, Large Multi-modal Models (LMMs) have demonstrated their ability to understand the visual contents of images given the instructions regarding the images. Built upon the Large Language Models (LLMs), LMMs also inherit their abilities and characteristics such as in-context learning where a coherent sequence of images and texts are given as the input prompt. However, we identify a new limitation of off-the-shelf LMMs where a small fraction of incoherent images or text descriptions mislead LMMs to only generate biased output about the hijacked context, not the originally intended context. To address this, we investigate whether replacing the hijacked visual and textual contexts with the correlated ones via GPT-4V and text-to-image models can help yield coherent responses.
Submission Number: 40
Loading