CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation

Minghao Fu; Guo-Hua Wang; Liangfu Cao; Qing-Guo Chen; Zhao Xu; Weihua Luo; Kaifu Zhang

CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation

Minghao Fu, Guo-Hua Wang, Liangfu Cao, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

TL;DR: We propose CHATS, a novel generative framework that facilitates the collaboration between human preference alignment and test-time sampling.

Abstract: Diffusion models have emerged as a dominant approach for text-to-image generation. Key components such as the human preference alignment and classifier-free guidance play a crucial role in ensuring generation quality. However, their independent application in current text-to-image models continues to face significant challenges in achieving strong text-image alignment, high generation quality, and consistency with human aesthetic standards. In this work, we for the first time, explore facilitating the collaboration of human performance alignment and test-time sampling to unlock the potential of text-to-image models. Consequently, we introduce CHATS (Combining Human-Aligned optimization and Test-time Sampling), a novel generative framework that separately models the preferred and dispreferred distributions and employs a proxy-prompt-based sampling strategy to utilize the useful information contained in both distributions. We observe that CHATS exhibits exceptional data efficiency, achieving strong performance with only a small, high-quality funetuning dataset. Extensive experiments demonstrate that CHATS surpasses traditional preference alignment methods, setting new state-of-the-art across various standard benchmarks. The code is publicly available at github.com/AIDC-AI/CHATS.

Lay Summary: Imagine having two friendly artists in your computer: one whose only job is to remember what beautiful, on-point images look like, and another who spots mistakes and odd quirks you definitely don’t want. You feed them around 7,500 pairs of “I love this” and “please don’t show this” examples, and they learn their crafts. When you type in a prompt, a clever mixer called the proxy-prompt sampler blends their advice, nudging the final picture toward the good stuff and steering clear of the bad. The result? Lifelike, clear, and creative images that match your words and your taste—without complicated tuning or massive data. We tested this on popular challenges for beauty, object counts, and scene details, and it beats all the usual methods. With CHATS, turning simple text into stunning visuals feels as natural as sketching a quick doodle, bringing your ideas to life in just a few seconds.

Link To Code: https://github.com/AIDC-AI/CHATS

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: Diffusion Models, Flow Matching Models, Human Preference Optimization, Classifier-Free Guidance, Text-to-Image Generation

Submission Number: 1411

Loading