everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
We propose an unsupervised model for instruction-based image editing that eliminates the need for ground-truth edited images during training. Traditional supervised approaches depend on datasets containing triplets of input image, edited image, and edit instruction, often generated by either existing editing methods or human-annotations, which introduce biases and limit their generalization ability. Our model addresses these challenges by introducing a novel editing mechanism called Cycle Edit Consistency (CEC). We propose to apply a forward and backward edit in one training step and enforce consistency in both the image and attention space. This allows us to bypass the need for ground-truth edited images and unlock training on datasets comprising either real image-caption pairs or image-caption-edit triplets. We empirically show that our unsupervised method achieves better performance across a wider range of edits with high fidelity and precision. By eliminating the need for pre-existing datasets of triplets, reducing biases associated with supervised methods, and introducing CEC, our work represents a significant advancement in unblocking scaling of instruction-based image editing.