Guide PLMs Towards Expressive Captioning via Knowledge Alignment and Lift

16 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Vision Language Models, Large Language Models, Pretrained Language Models, Image Captioning, Knowledge Control, Domain gap
TL;DR: Our model-agnostic framework aims to guide PLMs to generate fine-grained texts from given images via knowledge alignment and its lift
Abstract: As image captioning studies for Vision-Language Pretraining (VLP) have used pre-trained (vision) language models, PLMs, and demonstrated competitive performance, our goal is to guide PLMs towards fine-grained captioning. To this goal, while addressing their problems, we focus on the “knowledge gap (visual vs semantic, and source vs target domain)”. Unlike prior work, KEIC is based on the idea that knowledge is projected into PLMs as hidden spaces, and this gap is closed by fine-grained knowledge alignment and lift in these spaces. This idea motivates us to propose a model-agnostic framework, \textbf{K}nowledge \textbf{E}mpowered \textbf{I}mage \textbf{C}aptioning (KEIC) instead of models; KEIC consists of a Topic Control Mechanism (TCM), a Mapping Layer (MaL), a Modified Attention Mechanism (MAM), and training tasks, MRM, TTM, and ITM. Experiments show that KEIC outperforms baselines in captioning tasks, where its newly introduced components contribute to expressive caption generation via knowledge alignment and lift.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 492
Loading