Multi-Class Textual-Inversion Secretly Yields a Semantic-Agnostic Classifier

Kai Wang, Fei Yang, Bogdan Raducanu, Joost van de Weijer

Published: 27 Jan 2025, Last Modified: 06 Mar 2025WACV2025EveryoneCC BY 4.0

Abstract: With the advent of large pre-trained vision-language models such as CLIP, prompt learning methods aim to enhance the transferability of the CLIP model. They learn the prompt given few samples from the downstream task given the specific class names as prior knowledge, which we term the semantic-aware classification. However, in many realistic scenarios, we only have access to few samples and no knowledge of the class names. This challenging scenario represents the semantic-agnostic discriminative case. Text-to-Image (T2I) personalization methods aim to adapt T2I models to unseen concepts by learning new tokens and endowing these tokens with the capability of generating the learned concepts. However, these methods overlook the semantic-agnostic discriminative characteristics of newly learned tokens, which implies that they do not require the knowledge of class names as a semantic-aware prior. In this paper, we first explore the Textual Inversion approach and reveal that the new concept tokens possess both generation and classification capabilities by regarding each category as a single concept. However, without specific constraints on the classification task, token updates can proceed in suboptimal directions and lead to unsatisfactory classification performance. To mitigate this issue, we propose a discriminative regularization term for the token updating process. Using this technique, our method TI-SAC achieves stronger Semantic-Agnostic Classification while preserving the generation capability of these modifier tokens given only few samples per category. In the experiments, we extensively evaluate TI-SAC on 12 datasets covering various scenarios, which demonstrates that TI-SAC achieves superior results in terms of both classification and generation outcomes.