everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Instruction-based image editing methods allow user-friendly instruction to enhance controllability via natural command. However, without a user-provided mask, existing methods could not identify and edit specific objects if multiple similar instances exist, such as \textit{``add the man on the right a hat''}. Furthermore, the iterative nature of the editing process may inherently involve ambiguous references from users, such as \textit{`change it to blue'}, posing challenges in identifying the target without a contextual understanding. Multimodal large language models (MLLMs) offer impressive cross-modal comprehension and co-reference resolution capabilities. In this work, we present \emph{ReferPix2Pix}, which leverages MLLMs to interpret editing instructions and provide regions of interest (RoI) for precise editing. Such pixel-grounded guidance from MLLMs enhances comprehension of referring expressions and resolves ambiguous references that facilitate localized editing of editing models. Additionally, we developed CoReferEdit benchmark to evaluate editing capabilities across iterative editing phases with multimodal co-references. Our comprehensive experiments show that our approach significantly enhances editing capability in referring and co-referential editing tasks. Our code and data will be made publicly available\footnote{Please refer to the \href{https://anonymous.4open.science/r/ReferPix2Pix}{anonymous webpage} for code and qualitative results.}.