Keywords: domain adaptation; vision transformers
TL;DR: ViT-based models substantially increase the performance on the challenge of unsupervised domain adaptation without specialised finetuning.
Abstract: The vision transformer-based foundation models, such as ViT or Dino-V2, are aimed at solving problems with little or no finetuning of features. Using a setting of prototypical networks, we analyse to what extent such foundation models can solve unsupervised domain adaptation without finetuning over the source or target domain. Through quantitative analysis, as well as qualitative interpretations of decision making, we demonstrate that the suggested method can improve upon existing baselines, as well as showcase the limitations of such approach yet to be solved. The code is available at: https://github.com/lira-centre/vit_uda/
Submission Number: 34
Loading