KLIP: Keyword-Guided Language-Image Pretraining for Data-Efficient Domain-Specific Image Captioning

21 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Image Captioning, Vision-Language Pretraining
Abstract: Image captioning aims to generate natural language descriptions for a given image. While recent vision-language models have shown promising progress on this task, it is still challenging to finetune such models for particular domains with limited image-caption training data. To enable domain-specific few-shot image captioning, we propose a Keyword-Guided Language-Image Pretraining (KLIP) scheme, which learns entity-oriented keywords for aligning visual and textual modalities in each data domain for pre-training and fine-tuning. While our pre-training objectives enables the above alignment for vision-language models, the identified keywords further serve as prompts for regularizing the model during the fine-tuning stage. As a result, potential overfitting problems can be mitigated. Extensive experiments on benchmark datasets show that our KLIP performs favorably against state-of-the-art VLMs with various parameter-efficient fine-tuning techniques for domain-specific yet data-efficient image captioning.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3480
Loading