Abstract: In human-centric content generation, the pre-trained text-to-image models struggle to produce user-wanted por-trait images, which retain the identity of individuals while exhibiting diverse expressions. This paper introduces our efforts towards personalized face generation. To this end, we propose a novel multi-modal face generation frame-work, capable of simultaneous identity-expression control and more fine-grained expression synthesis. Our expression control is so sophisticated that it can be specialized by the fine-grained emotional vocabulary. We devise a novel dif-fusion model that can undertake the task of simultaneously face swapping and reenactment. Due to the entanglement of identity and expression, separately and precisely control-ling them within one framework is a nontrivial task, thus has not been explored yet. To overcome this, we propose sev-eral innovative designs in the conditional diffusion model, including balancing identity and expression encoder, improved midpoint sampling, and explicitly background con-ditioning. Extensive experiments have demonstrated the controllability and scalability of the proposed framework, in comparison with state-of-the-art text-to-image, face swap-ping, and face reenactment methods.
Loading