Keywords: text-to-image generation, diffusion models, concept learning
Abstract: Text-to-image models are increasingly utilized in design workflows, but articulating nuanced design intentions through text remains a challenge. This work proposes a method that extracts a visual attribute from a reference image and injects it directly into the generation pipeline. The method optimizes a text token to exclusively represent the target attribute using a custom training prompt and two novel embeddings: distilled embedding and residual embedding. Through this approach, a wide range of attributes can be extracted, including the shape, material, or color of an object, as well as the camera angle of the image. The method is validated on various target attributes and text prompts drawn from a newly constructed dataset. The results show that it outperforms existing approaches in selectively extracting and applying target attributes across diverse contexts. Ultimately, the proposed method enables intuitive and controllable text-to-image generation, streamlining the design process.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 22836
Loading