KLIP: Keyword-Guided Language-Image Pretraining for Data-Efficient Domain-Specific Image Captioning

Chi-Pin Huang; Kai-Po Chang; Fu-En Yang; Chung-Ting Tsai; Yung-Hsuan Lai; Yu-Chiang Frank Wang

KLIP: Keyword-Guided Language-Image Pretraining for Data-Efficient Domain-Specific Image Captioning

Chi-Pin Huang, Kai-Po Chang, Fu-En Yang, Chung-Ting Tsai, Yung-Hsuan Lai, Yu-Chiang Frank Wang

21 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: Image Captioning, Vision-Language Pretraining

Abstract: Image captioning aims to generate natural language descriptions for a given image. While recent vision-language models have shown promising progress on this task, it is still challenging to finetune such models for particular domains with limited image-caption training data. To enable domain-specific few-shot image captioning, we propose a Keyword-Guided Language-Image Pretraining (KLIP) scheme, which learns entity-oriented keywords for aligning visual and textual modalities in each data domain for pre-training and fine-tuning. While our pre-training objectives enables the above alignment for vision-language models, the identified keywords further serve as prompts for regularizing the model during the fine-tuning stage. As a result, potential overfitting problems can be mitigated. Extensive experiments on benchmark datasets show that our KLIP performs favorably against state-of-the-art VLMs with various parameter-efficient fine-tuning techniques for domain-specific yet data-efficient image captioning.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3480

Loading