Keywords: vision language, kowledge distillation, vcr, vqa, snli-ve, visual question answering, commonsense reasoning, pretraining, multimodal, robust, low-shot, zero-shot, domain-shift, debiased
TL;DR: Leveraging Pretrained Unimodal Encoders for Vision-Language Tasks via Adaptive Knowledge Distillation
Abstract: Large-scale image-text pairs, such as image-captions and image-phrases, enable the strong representation of vision-language (VL) models. Nevertheless, they lose diversity and complexity due to the constraints in collecting data. Meanwhile, models pre-trained with image-only or text-only data (we call them unimodal pretrained models) continue to flourish and impress the community. Compared to image-text pairs, unimodal data has less constraints during the collection process resulting in more diverse styles. A natural question is how to leverage unimodal pretrained models to benefit downstream VL tasks? Most existing works focus on fusing VL information in the expensive pre-training stage. They directly plug in unimodal pre-trained encoders into a VL framework and redo an additional pre-training step on paired image-text data. This causes additional computation expense and the unimodal pretrained knowledge might be forgotten. In this paper, we take a different route and investigate how to fuse VL information in the finetuning stage oaly. To directly transfer pretrained knowledge from unimodal models to belp downstream VL tasks, we propose $\mathrm{ADVL}$, which avoids redoing any pre-training step and is generalizable to be applied of top of various VL models. To comprehensively demonstrate the effectiveness of ADVL, we conduct evaluation across three mostly recognized highly semantic VL benchmarks: VCR, VQA, and SNLI-VE under three settings, low-shot, full-shot and domainshifted settings. Results show that ADVL consistently improves the performance with different VL base models across all settings. It even achieves state-of-theart (SOTA) performance on VCR among models pre-trained with image-text data and delivers competitive results on VQA and SNLI-VE, Based on our analysis, we also discover that ADVL can improve the robustness of VL models and regulate them to better use vision information.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
4 Replies
Loading