CAT-LLM: Context-Aware Training enhanced Large Language Models for multi-modal contextual image retrieval

Wei Li; Hehe Fan; Yongkang Wong; Mohan Kankanhalli; Yi Yang

CAT-LLM: Context-Aware Training enhanced Large Language Models for multi-modal contextual image retrieval

Wei Li, Hehe Fan, Yongkang Wong, Mohan Kankanhalli, Yi Yang

21 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: LLM, Composed image retrieval, multi-modal contextual image retrieval, context-aware training

TL;DR: CAT-LLM: Context-Aware Training enhanced Large Language Models for multi-modal contextual image retrieval

Abstract: Recently, the unprecedented advancement of Large Language Models (LLMs) has revolutionized in numerous applications in the vision-language domain. Inspired by the extraordinary visual understanding and logical reasoning abilities, we pro- pose a method that employs LLMs to address the Multi-Modal Contextual Image Retrieval (MMCIR) problem, where the input hints include both visual and textual queries. Specifically, given a query comprising a sequence of images and texts, MMCIR aims to select an image from a gallery that best matches the context of the query. In this paper, we first construct a Multi-Modal Captioning (MMC) dataset by enriching existing image captioning datasets from ⟨image, caption⟩ to ⟨reference image, reference caption, text condition, target caption⟩. Then, we introduce a Context-Aware Captioning (CA-Cap) and a Context-Aware Text Matching (CA-TM) objective to instruct a frozen LLM for MMCIR. These specialized objectives enable the LLM to better understand multi-modal inputs and output visual representation from complex multi-modal contexts. Comprehensive experiments demonstrate the effectiveness of our method on recent Zero- Shot Composed Image Retrieval (ZS-CIR) benchmarks (i.e., CIRCO, CIRR, and GeneCIS), and in complex scenarios with dense multi-modal inputs like Visual Storytelling and Visual Dialog.

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3122

Loading