HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding

Linhui Xiao; Xiaoshan Yang; Fang Peng; Yaowei Wang; Changsheng Xu

HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding

Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, Changsheng Xu

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual or linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (HiLoRA) paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. HiLoRA prevents the accumulation of perceptual errors by adapting the cross-modal features from shallow to deep layers in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach and showcase the significant grounding capabilities as well as promising energy efficiency advantages. The project page: https://github.com/linhuixiao/HiVG.

Primary Subject Area: [Content] Vision and Language

Secondary Subject Area: [Experience] Multimedia Applications

Relevance To Conference: Visual Grounding (VG) is a crucial topic in the field of multimodal vision and language communities, which involves locating a specific region described by linguistic text expressions within an image. Existing works utilized uni-modal pre-trained models to transfer visual/linguistic knowledge separately while ignoring the multimodal corresponding information. Thus, in this paper, we mainly study the multimodal transfer learning of the vision-language pre-trained models, and propose a hierarchical multimodal fine-grained modulation framework, namely HiVG. It is a concise and efficient framework that can simultaneously alleviate two kinds of task gaps (i.e., data bias and learning objectives) through a multi-layer adaptive cross-modal bridge and a hierarchical low-rank adaptation paradigm. The cross-modal bridge can address the inconsistency between visual features and those required for grounding, and establish a connection between multi-level visual and text features. Hi LoRA prevents the accumulation of perceptual errors by adapting the cross-modal features in a hierarchical manner. Experimental results on five datasets demonstrate the effectiveness of our approach. We believe that this research is suitable for submission to ACM Multimedia 2024, as it presents a novel idea of hierarchical multimodal fine-grained modulation for the multimodal communities, which also will facilitate the advancements in the field of multimedia applications.

Supplementary Material: zip

Submission Number: 2398

Loading