Can generative models generate novel objects the same as familiar objects?

Zhaokun Xue; Chang Ye; Gamaleldin Fathy Elsayed; Junfeng He

Can generative models generate novel objects the same as familiar objects?

Zhaokun Xue, Chang Ye, Gamaleldin Fathy Elsayed, Junfeng He

Published: 07 May 2025, Last Modified: 29 May 2025VisCon 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: novel concept, novel object, text inversion, text-to-image generation, human evaluation

TL;DR: This work investigates whether text-to-image models generalize novel concepts as effectively as familiar ones by using prompt tuning with randomly initialized embeddings with human survey evaluation.

Abstract: Text-to-image generative models have demonstrated great performance in generating realistic images. These generations are assumed to reflect a deep understanding of visual scenes. One interesting question is whether these models can possess a zero/few shot generalization capabilities that are known from humans. For example, a human can see an example of a new object and a word associated with this object, use their knowledge in a highly general way to recognize or imagine this novel object in a completely different setting or context. In this work, we are interested in testing whether text-image models can possess this same capability. In this work, we would like to test the hypothesis that text-to-image models may learn familiar objects better than novel objects. We use prompt tuning methods to learn soft token representations for those novel concepts while keeping the text-image models fixed. We prompt tune the model as well to learn familiar concepts, and evaluate the generalization ability for novel objects compared to familiar objects by running generation in different contexts/environments. In addition, instead of initializing the embedding vectors with some similar concepts, we use randomly initialized embedding vectors for both familiar and novel objects. Our human-survey evaluation results demonstrates that in some settings text-image models learn familiar objects better than novel objects.

Submission Number: 31

Loading