UniBoost: Boost Zero-shot Vision-Language Tasks via Multitask Fine-tuning with Unsupervised Unimodal Pre-training

19 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Supplementary Material: pdf
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: vision-language learning, multitask fine-tuning, unsupervised pre-training
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Large-scale joint training of multimodal models, e.g., CLIP, have demonstrated great performance in many vision-language tasks. However, pre-training with image-text pairs limits itself to cover a wide range of unimodal data, where noise can also be introduced as misaligned pairs during pre-processing. Conversely, unsupervised training of unimodal models on text or image data alone can achieve broader coverage of diverse real-world data. This motivates us to build a method based on unsupervised pre-trained unimodal models to enhance the zero-shot performance for vision-language tasks. Overall, our method is a multitask fine-tuning framework initialized from separate unsupervised pre-trained vision and language encoders, which allows the model to benefit from both the unsupervised pre-training and a variety of supervised data. Experiments show that our method outperforms state-of-the-art CLIP-based models by 6.5\% (52.3\% $\rightarrow$ 58.8\%) on PASCAL-5$^i$ and 6.2\% (27.2\% $\rightarrow$ 33.4\%) on COCO-20$^i$ under zero-shot language-guided semantic segmentation setting respectively. By learning representations of both modalities, unimodal pre-training offers strong generalization ability, while multitask fine-tuning shares knowledge across tasks and enhance domain adaptation, resulting in better performance especially for zero-shot vision-language tasks.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1768
Loading