ReferPix2Pix: Guiding  Multi-Modal LLMs for Image Editing with Referential Pixel Grounding

Xiaoqian Shen; Mohamed Elhoseiny

ReferPix2Pix: Guiding Multi-Modal LLMs for Image Editing with Referential Pixel Grounding

Xiaoqian Shen, Mohamed Elhoseiny

27 Sept 2024 (modified: 14 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Image Editing; Multimodal Large Language Models; Referring Expression Comprehension; Multimodal Coreference Resolution

Abstract: Instruction-based image editing methods allow user-friendly instruction to enhance controllability via natural command. However, without a user-provided mask, existing methods could not identify and edit specific objects if multiple similar instances exist, such as \textit{``add the man on the right a hat''}. Furthermore, the iterative nature of the editing process may inherently involve ambiguous references from users, such as \textit{`change it to blue'}, posing challenges in identifying the target without a contextual understanding. Multimodal large language models (MLLMs) offer impressive cross-modal comprehension and co-reference resolution capabilities. In this work, we present \emph{ReferPix2Pix}, which leverages MLLMs to interpret editing instructions and provide regions of interest (RoI) for precise editing. Such pixel-grounded guidance from MLLMs enhances comprehension of referring expressions and resolves ambiguous references that facilitate localized editing of editing models. Additionally, we developed CoReferEdit benchmark to evaluate editing capabilities across iterative editing phases with multimodal co-references. Our comprehensive experiments show that our approach significantly enhances editing capability in referring and co-referential editing tasks. Our code and data will be made publicly available\footnote{Please refer to the \href{https://anonymous.4open.science/r/ReferPix2Pix}{anonymous webpage} for code and qualitative results.}.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11901

Loading