Auth-Prompt Bench: Towards Reliable and Stable Prompting in Text-to-Image Generation

ICLR 2026 Conference Submission16248 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: text-to-image dataset and benchmark, large language models, prompt optimization
Abstract: Recent advances in diffusion models and large language models (LLMs) have enabled text-to-image generation of remarkable fidelity, allowing users to synthesize images directly from natural language prompts. However, we observe that prompt informativeness varies significantly across user proficiency levels, which draws little research attention in the field. Those ambiguous or under-specified prompts often lead to unstable outputs that deviate from user intent, while current benchmarks provide limited means to quantify this phenomenon. We address this gap by introducing Authentic Prompt Benchmark (Auth-Prompt Bench), a large-scale benchmark of 17,580 prompt-image pairs from both novice and expert users sourced from authentic web cases, specifically designed to evaluate the stability of the prompting in text-to-image generation. Unlike existing metrics that focus solely on prompt–image alignment, Auth-Prompt Bench is further grounded in an information-theoretic perspective of prompt-to-prompt transmission, enabling stability assessment through complementary metrics: mutual information, prompt entropy, and prompt energy. Building on these insights, we propose NoxEye, an end-to-end prompt optimization framework comprising (i) an information enhancer that maps user prompts toward the model-preferred distribution, and (ii) an information aligner that enforces fine-grained alignment of visual entities. Across Auth-Prompt Bench and other established benchmarks, NoxEye delivers improvements of up to 13.52\% in mutual information, 20.30\% in prompt entropy, and 27.01\% in prompt energy and enhances prompts from novice users, over state-of-the-art baselines. Our results establish Auth-Prompt Bench as one of the first dedicated benchmarks for stability in T2I generation and demonstrate that information-theoretic prompt optimization can significantly enhance both robustness and fidelity. The human evaluation results further verify the efficacy of our method in user intent alignment for T2I generation. We hope this work provides a foundation for the community on principled evaluation and reliable user–model interaction in T2I generative systems. The source code and dataset will be made publicly available.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16248
Loading