Abstract: Human image generation has achieved remarkable success with diffusion models in recent years but still struggles with generating realistic and high-quality hands. Existing methods focus on leveraging some kind of hand prior representations (e.g. pose skeleton, depth map, or 3D mesh) to guide the generation pipeline. However, there is a fact underrated by most approaches, namely, diverse kinds of representations actually work on different principles. Meanwhile, how to simultaneously utilize multiple hand priors to further improve generation quality has not been fully explored. Motivated by this, we propose a Mixture-of-Hand-Experts (MoHE) framework to repaint the deformed hands following the stable diffusion inpainting pipeline. This approach implements multiple ControlNets as the hand experts finetuned with different hand representations, and fuses them dynamically to achieve the optimal guidance for hand generation. Given a generated image with malformed hands, we first employ the hand mesh reconstruction technique to obtain diverse hand representations and inject them into the ControlNets as the control conditions. Then a gating network in the MoHE module determines the control scale of each branch for the conditioned inpainting. Finally, extensive experiments demonstrate the effectiveness of our proposed method in enhancing the realism and plausibility of hands generated by diffusion models, as validated on the HAGRID dataset. Our code is available at https://github.com/WangYuXuan2022/Mixture-of-Hand-Experts.
Loading