ReMix: Towards a Unified View of Consistent Character Generation and Editing

12 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Model; Consistent Character Generation; Image Editing
TL;DR: A unified approach to character-consistent generation and image editing
Abstract: Consistent character generation and editing has made significant strides in recent years, driven by advancements in large-scale text-to-image diffusion models (e.g., FLUX.1) that produce high-fidelity outputs. Yet, few methods effectively unify them within a single framework. Generation-based methods still struggle to enforce fine-grained consistency, especially when tracking multiple instances, whereas editing-based approaches often face challenges in preserving posture flexibility and instruction understanding. To address this gap, we propose **ReMix**, a unified framework for character-consistent generation and editing. It consists of two main components: the ReMix Module and IP-ControlNet. The ReMix Module leverages the multimodal understanding capabilities of MLLM to edit the semantic content of the input image, and adapts the instruction features to be compatible with a native DiT backbone. While semantic editing can ensure coherent semantic layout, it cannot guarantee consistency in pixel space and posture controllable. To this end, IP-ControlNet is introduced to coupe with these problems. Specifically, inspired by convergent evolution in biology and by decoherence in quantum systems, where environmental noise induces state convergence, we hypothesize that jointly denoising the reference and target images within a same noise space promotes feature convergence, thereby aligning the hidden feature space. Therefore, architecturally, we extend ControlNet to not only handle sparse signals but also decouple semantic and layout features from reference images as input. For optimization, we establish an ε-equivariant latent space, allowing visual conditions to share a common noise space with the target image at each diffusion timestep. We observed that this alignment facilitates consistent object generation while faithfully preserving reference character identities. Through the above design, ReMix supports a wide range of visual-guidance tasks, including personalized generation, image editing, style transfer, and multi-visual-condition generation, among others. Extensive quantitative and qualitative experiments have demonstrated the effectiveness of our proposed unified framework and optimization theory.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4367
Loading