Keywords: PEFT, PLM, LLM, VLM, Multi-modal, Image captioning
TL;DR: KLAFT aims to generate detailed and comprehensive captions by leveraging fine-grained alignment between PLMs and target domain datasets.
Abstract: We present a visual tuning framework, \textbf{K}nowledge \textbf{L}ift \textbf{A}lignment \textbf{F}ine \textbf{T}uning (KLAFT),
which enhances the expressive image captioning capabilities of Pre-trained Language Models (PLMs), including LLMs and VLMs.
As this task involves generating more detailed and comprehensive captions than basic image descriptions,
the core idea behind KLAFT is that fine-grained alignment could exploit the capabilities of PLMs and a given target domain dataset.
This idea motivates and challenges us to explore the framework that deeply understands both given images and text for this alignment and tuning PLMs towards expressive image captioning.
This direction modifies the attention mechanism (Modified Attention Mechanism, MAM) and develops both a Topic Control Mechanism (TCM) and their training objectives.
The innovation of KLAFT lies in its approach to addressing the disparities in knowledge - visual versus textual via MAM
and source versus target domain via TCM.
As these hidden spaces are conceptualized as distinct sub-networks within the PLM, each possessing specific knowledge,
KLAFT's unique contribution is in aligning and adjusting the weights of these sub-networks in a fine-grained manner,
and fine-tuning this PLM.
Our empirical studies demonstrate that KLAFT significantly improves expressive captioning tasks by aligning and amplifying target knowledge, with the potential for Parameter-Efficient Fine-Tuning (PEFT) at low computational cost.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1827
Loading