Abstract: Highlights•A novel framework for controllable human image generation with multimodal controls is proposed. The framework’s flexibility lies in its ability to incorporate different types of control signals, which can be modified to suit the requirements of other datasets or domains.•A new architecture that integrates wavelet features into the conventional VQ-VAE for enhancing image reconstruction and synthesis quality is proposed. The wavelets-based features help in capturing fine-grained details in the images, thereby enhancing the quality of the generated images.•A multimodal conditioned diffusion model is introduced for enabling the generation of high-quality human images conditioned on multimodal controls. This model is capable of modeling the complex dependencies between the image features and the control signals, allowing for a flexible and efficient image generation process.
Loading