Keywords: Reward model; Reinforcement learning; Reasoning model.
Abstract: While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored.
A key bottleneck is the lack of a robust general reward model for all editing tasks.
Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards.
To address this, we propose Edit-R1, which boosts image editing models with a chain-of-thought (CoT) reasoning reward model (RRM).
This Edit-RRM breaks instructions into verifiable principles, evaluates the edited images against each principle, and aggregates fine-grained scores to reduce hallucinations and provide more interpretable criteria.
To build such an RRM, we first apply supervised fine-tuning (SFT) as a “cold-start” to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM.
After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model.
Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.
Primary Area: reinforcement learning
Submission Number: 1383
Loading