Abstract: How do recent vision-language pre-trained models compare against language-specific pre-trained models on common linguistic tasks? In this paper, we assess this in a probing setting. Our results suggest that different multimodal pre-training strategies entail distinct strengths. Although pre-trained language models generally fare better, pre-trained vision-language models can obtain higher average scores in certain scenarios (e.g., CLIP is $2\%$ higher than BERT on SST2). We also analyze and illustrate that the different competences in different model layers cause such performance differences. Our work then proposes fine-tuning techniques to improve the abilities of vision-language models on linguistic tasks.
Paper Type: short
0 Replies
Loading