Multi-Instance Text-to-Image Generation via Instance Disentanglement and Reinforcement Learning

Bo Li, Fengxiang Yang

Published: 09 Nov 2024, Last Modified: 11 Apr 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: In this paper, we study multi instance generation (MIG) problem, simultaneously generating multiple instances with captions and layout information within an image. The main challenges of MIG lie in attribute leakage and generalization to novel scenes, which are caused by weak text features and lack of unbiased training data. We thus present IDCRL, jointly using instance disentanglement constraint (IDC) and reinforcement learning (RL) to handle these challenges. IDC deploys instance disentangle modules to utilize mutual synergy to obtain task-related features. This disentangling process is supervised by requiring task-related features to trigger more accurate cross-attention maps than task-unrelated ones, enforcing the former to memorize information of all instances and thus avoid attribute leakage. Moreover, to improve MIG generalization, we further introduce reinforcement learning with the aid of critic-net and reward model. The reward model is a visual language model, offering unbiased rewards to facilitate the collaboration between critic-net and MIG model. The training strategy enables MIG model to seek the best generation policy and thus improves MIG capability. Lastly, the proposed two components could be integrated in a coherent manner, further improving the generation capability. Extensive experiments on large-scale MIG benchmarks demonstrate the efficacy of our method.