GG-Editor: Locally Editing 3D Avatars with Multimodal Large Language Model Guidance

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Text-driven 3D avatar customization has attracted increasing attention in recent years, where precisely editing specific local parts of avatars with only text prompts is particularly challenging. Previous editing methods usually use segmentation or cross-attention masks as constraints for local editing. Although these masks tightly cover existing objects/parts, they may limit editing methods to create drastic geometry deformations beyond the covered contents. From a different perspective, this paper presents a GPT-guided local avatar editing framework, namely GG-Editor. Specifically, GG-Editor progressively mines more reasonable candidate editing regions via harnessing multimodal large language models which already organically assimilate common-sense human knowledge. In order to improve the editing quality of the local areas, GG-Editor explicitly decouples the geometry/appearance optimization, and adopts a global-local synergy editing strategy with GPT-generated local prompts. Moreover, to preserve concepts residing in source avatars, GG-Editor proposes an orthogonal denoising score that orthogonally decomposes editing directions and introduce an explicit term for preservation. Comprehensive experiments demonstrate that GG-Editor with only textual prompts achieves realistic and high-fidelity local editing results, significantly surpassing prior works. Project page: https://xuyunqiu.github.io/GG-Editor/.
Primary Subject Area: [Generation] Generative Multimedia
Secondary Subject Area: [Experience] Multimedia Applications, [Experience] Art and Culture, [Generation] Multimedia Foundation Models
Relevance To Conference: This work investigates locally editing 3D avatars with only textual guidance. For such a challenging task, this work brings some fresh insights and presents three novel technical contributions. Firstly, we introduce a novel multimodal LLM-guided framework that first integrates common-sense human knowledge and progressively mines reasonable candidate regions for local editing. Secondly, we devise an effective global-local view synergy editing strategy to improve the local editing results by training models with additional local renderings and GPT-generated local prompts. Thirdly, we also present a new orthogonal denoising score that orthogonally decomposes the editing directions and introduces an explicit term to adjust the preservation of the source concept. Our proposed method with only textual prompts achieves realistic and high-fidelity local editing results, significantly boosting the multimodal 3D avatar local editing performance.
Supplementary Material: zip
Submission Number: 26
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview