Keywords: Benchamark; Image Editing; Diffusion Model; Reinforcement Learning
TL;DR: We extend instruction-guided editing from single- to dual-reference with a two-stage framework: finetuning for multi-image fusion and RL alignment for instruction fidelity, enabling composition, insertion, and transfer beyond open-source baselines.
Abstract: Most instruction-guided image editing models assume a single reference image. However, many real-world tasks—such as combining people into a group portrait, integrating a subject into a scene, or transferring clothing between individuals—require reasoning across multiple inputs. Current approaches either fail outright or rely on ad-hoc heuristics to merge results.
In this work, we present the systematic study of dual-image instruction-guided editing. To support this, we construct a synthesized dataset of dual-image instructions spanning five representative categories: animal–scene composition, person–scene insertion, group portraits, style transfer, and clothing replacement. Building on the open-source single-image reference editing model, we introduce a dual positional embedding scheme with LoRA fine-tuning that enables efficient multi-reference fusion without catastrophic forgetting. Furthermore, we apply reinforcement alignment with Diffusion Denoising Policy Optimization (DDPO), using vision language model as a reward model to better align generations with editing instructions. Despite being trained on relatively small-scale data, our method achieves strong qualitative and quantitative improvements, surpassing existing open source baselines in multi-reference editing.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15849
Loading