Abstract: Text-to-image models are powerful for produc-
ing high-quality images based on given text
prompts, but crafting these prompts often re-
quires specialized vocabulary. To address this,
existing methods train rewriting models with
supervision from large amounts of manually
annotated data and trained aesthetic assessment
models. To alleviate the dependence on data
scale for model training and the biases intro-
duced by trained models, we propose a novel
prompt optimization framework, designed to
rephrase a simple user prompt into a sophisti-
cated prompt to a text-to-image model. Specifi-
cally, we employ the large vision language mod-
els (LVLMs) as the solver to rewrite the user
prompt, and concurrently, employ LVLMs as a
reward model to score the aesthetics and align-
ment of the images generated by the optimized
prompt. Instead of laborious human feedback,
we exploit the prior knowledge of the LVLM
to provide rewards, i.e., AI feedback. Simulta-
neously, the solver and the reward model are
unified into one model and iterated in reinforce-
ment learning to achieve self-improvement by
giving a solution and judging itself. Results
on two popular datasets demonstrate that our
method outperforms other strong competitors
Loading