Abstract: Most existing text-driven face image generation and manipulation methods are based on StyleGAN2, which is inherently limited to aligned faces and therefore makes these methods fail to preserve the highly variable face placement. Additionally, these methods directly leverage a pairwise loss to learn the correspondence between the image and text, which can not handle complex text descriptions, e.g., the text with multiple captions describes multiple facial attributes. To address these issues, we explore the feasibility of applying the more advanced StyleGAN3 to generate and manipulate the face images in an Open-World setup, e.g., the target face image is not required to be aligned and the text description contains multiple captions. To this end, we first design an improved iterative refinement strategy that adaptively predicts the generator weight offsets rather than residuals for the inverted latent code via a hypernetwork, which efficiently finds a desired generator with no image-specific optimization. We further analyze the disentanglement of different StyleGAN3 latent spaces and demonstrate that the ${\mathcal {S}}$ space learns a more semantically-disentangled representation. To enable complex edits mentioned by the multi-caption text, we propose a cross-modal feature filtration module with a probability adaptation strategy to capture the image-text correspondences. Finally, we incorporate a channel-wise attention mechanism to obtain a global latent manipulation direction, which learns to assign importance weights to different channels. Extensive experiments demonstrate the superior performance of our proposed method compared against the state-of-the-art methods.
Loading