Learning Vision and Language Concepts for Controllable Image Generation

Shaoan Xie; Lingjing Kong; Yujia Zheng; Zeyu Tang; Eric P. Xing; Guangyi Chen; Kun Zhang

Learning Vision and Language Concepts for Controllable Image Generation

Shaoan Xie, Lingjing Kong, Yujia Zheng, Zeyu Tang, Eric P. Xing, Guangyi Chen, Kun Zhang

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: A concept-learning framework to identify atomic visual and textual concepts

Abstract: Concept learning seeks to extract semantic and interpretable representations of atomic concepts from high-dimensional data such as images and text, which can be instrumental to a variety of downstream tasks (e.g., image generation/editing). Despite its importance, the theoretical foundations for learning atomic concepts and their interactions, especially from multimodal distributions, remain underexplored. In this work, we establish fundamental conditions for learning atomic multimodal concepts and their underlying interactions With identfiability guarantees. We formulate concept learning as a latent variable identification problem, representing atomic concepts in each modality as latent variables, with a graphical model to specify their interactions across modalities. Our theoretical contribution is to provide component-wise identifiability of atomic concepts under flexible, nonparametric conditions that accommodate both continuous and discrete modalities. Building on these theoretical insights, we demonstrate the practical utility of our theory in a downstream task text-to-image (T2I) generation. We develop a principled T2I model that explicitly learns atomic textual and visual concepts with sparse connections between them, allowing us to achieve image generation and editing at the atomic concept level. Empirical evaluations show that our model outperforms existing methods in T2I generation tasks, offering superior controllability and interpretability.

Lay Summary: Consider a common use case in text-to-image generation: a user provides a text prompt to generate an image, then wishes to make minor edits, such as changing only the color of the clothing. Controllable generation enables the user to modify the prompt accordingly, prompting the model to adjust the specified feature while preserving all other aspects of the image. This ability to make targeted changes without unintended alterations underscores the importance of controllable text-to-image generation. In this study, the authors provide a solid theoretical foundation for how to learn atomic vision and language concepts and understand how they relate to each other. They treat concept learning as a hidden structure problem, using a mathematical model where each concept is unknown. Their theory shows that it's possible to identify each concept—even when the data types are varied and complex—without making strict assumptions. They then apply their theory to improve a practical task: generating images from text descriptions. They introduce a new model that learns clear, sparse connections between text and image concepts. Tests show that this model produces better, more controllable results than existing methods.

Primary Area: General Machine Learning->Representation Learning

Keywords: Concept; Identifiability; Controllable text-to-image

Submission Number: 7503

Loading