Abstract: Recent Text-to-Image (T2I) generation models such as
Stable Diffusion and Imagen have made significant progress
in generating high-resolution images based on text descriptions. However, many generated images still suffer from
issues such as artifacts/implausibility, misalignment with
text descriptions, and low aesthetic quality. Inspired by the
success of Reinforcement Learning with Human Feedback
(RLHF) for large language models, prior works collected
human-provided scores as feedback on generated images
and trained a reward model to improve the T2I generation.
In this paper, we enrich the feedback signal by (i) marking
image regions that are implausible or misaligned with the
text, and (ii) annotating which words in the text prompt are
misrepresented or missing on the image. We collect such
rich human feedback on 18K generated images (RichHF18K) and train a multimodal transformer to predict the rich
feedback automatically. We show that the predicted rich
human feedback can be leveraged to improve image generation, for example, by selecting high-quality training data
to finetune and improve the generative models, or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably, the improvements generalize to
models (Muse) beyond those used to generate the images
on which human feedback data were collected (Stable Diffusion variants). The RichHF-18K data set will be released in our GitHub repository: https://github.com/googleresearch/google-research/tree/master/richhf 18k.
Loading