Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning

20 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Multi-Label Image Recognition, Text to Image, Parameter-Efficient Fine-Tuning
Abstract: Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to directly leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image feature to be similar with the corresponding text features, modality gap remains a nontrivial issue and limits the MLR performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pretrained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby being beneficial for reducing modality gap. For better PEFT, we further combine both prompt tuning and adapter learning for enhancing classification performance. Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS- WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average above the top-ranked state-of-the-art methods. Our code and models will be made publicly available.
Supplementary Material: pdf
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2599
Loading