Keywords: Multimodal Learning, Compositional Reasoning, Multimodal Large Language Models, Visual Representation Learning
Abstract: Multimodal Large Language Models (MLLMs) employ contrastive pre-trained Vision Encoders whose performance falls short in compositional understanding and visual reasoning. This is mostly due to their pre-training objective aimed at retrieval between similar images or captions rather than an in-depth understanding of all components of an image. Moreover, while state-of-the-art image encoding methods yield strong performance, they inflate the number of visual input tokens by roughly two to three times, thereby significantly lengthening both training and inference times. To alleviate these issues, we present **OG-LLaVA** (**O**bject-**G**uided **LLaVA**), a novel multimodal architecture which, through an innovative connector design ***OG-Fusion***, enhances the model's ability to understand and reason about visual content *without* substantially increasing the number of tokens or unfreezing the Vision Encoder. A core element of ***OG-Fusion*** is the combination of CLIP representations with segmentations. By leveraging the descriptive power of advanced segmentation models, **OG-LLaVA** attains superior performance at tasks that require a deeper understanding of object relationships and spatial arrangements, within the domains of compositional reasoning and visual grounding.
Submission Number: 34
Loading