Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: multimodal learning, vision-language model, domain adaptation, domain generalization
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: With the help of a pre-trained vision-language model, we extending a model to new domains without any information (both textual and visual information) of them
Abstract: To avoid the high cost of collecting visual data from all test domains in domain adaption task, recent work takes advantage of the pre-trained large-scale vision language models and augment training data with only text descriptions (e.g.,“a photo/painting/sketch...”) of each test domain. However, in many real-world ap- plications, such text information of test domains is not always available in ad- vance. Moreover, even if we can verbalize all test domains, it is laborious for existing work (Dunlap et al., 2023) to train a different augmentation network for each possible unseen domain. To overcome these challenges, we benefit from the multimodal embedding space of a pre-trained vision-language model and propose to acquire training-free and domain-invariant augmentations with text descrip- tions of arbitrary crafted unseen domains, which not necessarily match test do- mains. Beyond achieving state-of-the-art results, compared with existing works that require trainable augmentation networks, our approach is also notably more time-efficient, and exhibits a more solid theoretical support.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4755
Loading