Title: VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

Abstract: We introduce VASA, a framework for generating lifelike talking faces with appealing visual affective skills (VAS) given a single static image and a speech audio clip. Our premiere model, VASA-1, is capable of not only generating lip movements that are exquisitely synchronized with the audio, but also producing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness. The core innovations include a diffusion-based holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos. Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively. Our method delivers high video quality with realistic facial and head dynamics and also supports the online generation of 512×512 videos at up to 40 FPS with negligible starting latency. It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors.

Section: Introduction
In the realm of multimedia and communication, the human face is not just a visage but a dynamic canvas, where every subtle movement and expression can articulate emotions, convey unspoken messages, and foster empathetic connections. The emergence of AI-generated talking faces offers a window into a future where technology amplifies the richness of human-human and human-AI interactions. Such technology holds the promise of enriching digital communication [64,35], increasing accessibility for those with communicative impairments [29,1], transforming education methods with interactive AI tutoring [8,31], and providing therapeutic support and social interaction in healthcare [41,33].
As one step towards achieving such capabilities, our work introduces VASA-1, a new method that can produce audio-generated talking faces with a high level of realism and liveliness. Given a static Figure 1: Given a single portrait image, a speech audio clip, and optionally a set of other control signals, our approach produces a high-quality lifelike talking face video of 512× 512 resolution at up to 40 FPS. The method is generic and robust, and the generated talking faces can faithfully mimic human facial expressions and head movements, reaching a high level of realism and liveliness. (All the photorealistic portrait images in this paper are virtual, non-existing identities generated by [30,5]. See our project page for the generated video samples with audios.) face image of an arbitrary individual, alongside a speech audio clip from any person, our approach is capable of generating a hyper-realistic talking face video efficiently. This video not only features lip movements that are meticulously synchronized with the audio input but also exhibits a wide range of natural, human-like facial dynamics and head movements.
Creating talking faces from audio has attracted significant attention in recent years with numerous approaches proposed [77,39,75,51,25,62,63,61,71,74,36,26]. However, existing techniques are still far from achieving the authenticity of natural talking faces. Current research has predominantly focused on the precision of lip synchronization with promising accuracy obtained [39,61]. The creation of expressive facial dynamics and the subtle nuances of lifelike facial behavior remain largely neglected. This results in generated faces that seem rigid and unconvincing. Additionally, natural head movements also play a vital role in enhancing the perception of realism. Although recent studies have attempted to simulate realistic head motions [62,71,74], there remains a sizable gap between the generated animations and the genuine human movement patterns.
Another important factor is the efficiency of generation, which plays a pivotal role in real-time applications such as live communication. While image and video diffusion techniques have brought remarkable advancements in talking face generation [20,49,55] as well as the broader video generation field [6,9], the substantial computation demands have limited their practicality for interactive systems. A critical need exists for optimized algorithms that can bridge the gap between high-quality video synthesis and the low-latency requirements of real-time applications.
Given the limitations of existing methods, this work develops an efficient yet powerful audioconditioned generative model that works in the latent space of head and facial movements. Different from prior works, we train a Diffusion Transformer model on the latent space of holistic facial dynamics as well as head movements. We consider all possible facial dynamics -including lip motion, (non-lip) expression, eye gaze and blinking, among others -as a single latent variable and model its probabilistic distribution in a unified manner. By contrast, existing methods often apply separate models for different factors, even with interleaved regressive and generative formulations for them [62,76,71,60,74]. Our holistic facial dynamics modeling, together with the jointly learned head motion patterns, leads to the generation of a diverse array of lifelike and emotive talking behaviors. Furthermore, we incorporate a set of optional conditioning signals such as main gaze direction, head distance, and emotion offset into the learning process. This makes the generative modeling of complex distribution more tractable and increases the generation controllability.
To achieve our goal, another challenge lies in constructing the latent space for the aforementioned holistic facial dynamics and gathering the data for the diffusion model training. Beyond facial and head movements, a human face image contains other factors such as identity and appearance. In this work, we seek to build a proper latent space for human face using a large volume of face videos. Our aim is for the face latent space to possess both a total state of disentanglement between facial dynamics and other factors, as well as a high degree of expressiveness to model rich facial appearance details and dynamic nuances. We base our method on the 3D-aided representation [64,19] which was proven to be expressive, and equip it with a collection of newly-designed loss functions critical to effective disentanglement. Without the new designs we can never reach a high quality of talking face generation, especially the liveliness with nuanced emotions. Trained on face videos in an self-supervised or weakly-supervised manner, our encoder can produce well-disentangled factors including 3D appearance, identity, head pose and holistic facial dynamics, and the decoder can generate high quality faces following the given latent codes.
VASA-1 has collectively advanced the realism of lip-audio synchronization, facial dynamics, and head movement to new heights. Coupled with high image generation quality and efficient running speed, we achieved real-time talking faces that are realistic and lifelike. Through detailed evaluations, we show that our method significantly outperforms existing methods on a set of metrics, including a novel data-driven metric called Contrastive Audio and Pose Pretraining (CAPP) for measuring the audio-pose alignment and a pose variation intensity score that is related to the vividness of head motion. We believe VASA-1 brings us closer to a future where digital AI avatars can engage with us in ways that are as natural and intuitive as interactions with real humans, demonstrating appealing visual affective skills for more dynamic and empathetic information exchange.

Section: Related Work
Disentangled face representation learning. The representation of facial images through disentangled variables has been extensively studied by previous works. Some methods utilize sparse keypoints [44,72] or 3D face models [42,22,73] to explicitly characterize facial dynamics and other properties, but these can suffer from issues such as inaccurate reconstructions or limited expressive capabilities. There are also many works dedicated to learning disentangled representations within a latent space. A common approach involves separating faces into identity and non-identity components, then recombining them across different frames, either in a 2D [11,76,34,70,37,60,54] or 3D context [64,19,18]. The main challenge faced by these methods is the effective disentanglement of various factors while still achieving expressive representations of all static and dynamic facial attributes, which is addressed in this work.
Audio-driven talking face generation. Talking face video generation from audio inputs has been a long-standing task in computer vision and graphics. Early works have focused on synthesizing only the lips, achieved by mapping audio signals directly to lip movements while leaving other facial attributes unchanged [53,12,39,70,13]. More recent efforts have expanded the scope to include a broader array of facial expressions and head movements derived from audio inputs. For instance, the method of [74] separates the generation targets into different categories, including lip-only 3DMM coefficients, eye blinks, and head poses. [71] proposed to decompose lip and non-lip features on the top of the expression latent from [76]. Both [74] and [71] regress lip-related representations directly from audio features and model other attributes in a probabilistic manner. In contrast to these approaches, our method generates comprehensive facial dynamics and head poses from audio along with other control signals. This approach differs from the trend of further disentanglement, seeking instead to create more holistic and integrated outputs.
Video generation. Recent advances in generative models [10,27,48,47] have led to significant progress in video generation. Earlier video generation approaches [59,56,46] employed the adversarial learning [24] framework, while more recent methods [69,7,23,32,4,9] have leveraged diffusion or auto-regressive models to capture diverse video distributions. Recently, several works concurrent to us [55,65] have adapted video diffusion techniques to audio-driven talking face generation, achieving promising results despite the slow training and inference speeds. In contrast, our method is able to not only generating high-quality results but also achieve real-time efficiency -a metric crucial to efficiency-demanding applications such as live communication.

Section: Method
Overall framework. As illustrated in Fig. 1, our method takes a single face image, optional control signals, and a speech audio clip to produce a realistic talking face video. Instead of generating video frames directly, we generate holistic facial dynamics and head motion in the latent space conditioned on audio and other signals. To achieve this, we start by constructing a face latent space and training the face encoder and decoder. An expressive and disentangled face latent learning framework is crafted and trained on real-life face videos. Then we train a simple yet powerful Diffusion Transformer to model the motion distribution and generate the motion latent codes in the test time given audio and other conditions.

Section: Expressive and Disentangled Face Latent Space Construction
Given a corpus of unlabeled talking face videos, we aim to build a latent space for human face with high degrees of disentanglement and expressiveness. The disentanglement enables effective generative modeling of the human head and holistic facial behaviors on massive videos, irrespective of the subject identities. It also enables disentangled factor control of the output which is desirable in many applications. Existing methods fall short of either expressiveness [11,42,71,60] or disentanglement [64,19,73] or both. The expressiveness of facial appearance and dynamic movements, on the other hand, ensures that the decoder can output high quality videos with rich facial details and the latent generator is able to capture nuanced facial dynamics.
To achieve this, we base our model on the 3D-aid face reenactment framework from [64,19]. The 3D appearance feature volume can better characterize the appearance details in 3D compared to 2D feature maps. The explicit 3D feature warping is also powerful in modeling 3D head and facial movements. Specifically, we decompose a facial image into a canonical 3D appearance volume V app , an identity code z id , a 3D head pose z pose , and a facial dynamics code z dyn . Each of them is extracted from a face image by an independent encoder, except that V app is constructed by first extracting a posed 3D volume followed by rigid and non-rigid 3D warping to the canonical volume, as done in [19]. A single decoder D takes these latent variables as input and reconstructs the face image, where similar warping fields in the inverse direction are first applied to V app to get the posed appearance volume. Readers are referred to [19] for more details of this architecture.
To learn the disentangled latent space, the core idea is to construct image reconstruction loss by swapping latent variables between different images in videos. Our basic loss functions are adapted from [19]. However, we identified the poor disentanglement between facial dynamics and head pose using the original losses. The disentanglement between identity and motions is also imperfect. Therefore, we introduce several additional losses crucial to achieve our goal. Inspired by [37], we add a pairwise head pose and facial dynamics transfer loss to improve their disentanglement. Let I i and I j be two frames randomly sampled from the same video. We extract their latent variables using the encoders, and transfer I i 's head pose onto I j as Îj,z pose i = D(V app j , z id j , z pose i , z dyn j
) and I j 's facial motion onto
I i as Îi,z dyn j = D(V app i , z id i , z pose i , z dyn j
). The discrepancy loss l consist between Îj,z pose i and Îi,z dyn j is subsequently minimized. To reinforce the disentanglement between identity and motions, we add a face identity similarity loss l cross_id for the cross-identity pose and facial motion transfer results. Let I s and I d be the video frames of two different subjects, we can transfer the motions of
I d onto I s and obtain Îs,z pose d ,z dyn d = D(V app s , z id s , z pose d , z dyn d ).
Then, a cosine similarity loss between the deep face identity features [16] extracted from I s and Îs,z pose d ,z dyn d is applied. As we'll show in the experiments, our new loss function deigns are crucial to achieve an effective factor disentanglement and facilitate the high-quality, lifelike talking face generation.

Section: Holistic Facial Dynamics Generation with Diffusion Transformer
Given the constructed face latent space and trained encoders, we can extract the facial dynamics and head movements from real-life talking face videos and train a generative model. Crucially, we consider identity-agnostic holistic facial dynamics generation (HFDG), where our learned latent codes represent all facial movements such as lip motion, (non-lip) expression, and eye gaze and blinking. This is in contrast to existing methods that apply separate models for different factors with interleaved regression and generative formulations [62,76,71,60,74]. Furthermore, previous methods often train on a limited number of identities [74,68,21] and cannot model the wide range of motion patterns of different humans, especially given an expressive motion latent space.
In this work, we utilize diffusion models for audio-conditioned HFDG and train on massive talking face videos from a large number of identities. In particular, we apply a transformer architecture [58,38,52] for our sequence generation task. Figure 2 shows an overview of our HFDG framework.
Formally, a motion sequence extracted from a video clip is defined as
X = {[z pose i , z dyn i ]}, i = 1, . . . , W .
Given its accompanying audio clip a, we extract the synchronized audio features A = {f audio i }, for which we use a pretrained feature extractor Wav2Vec2 [3]. [27,47,48], the forward chain progressively adds Gaussian noise to the target data, while the reverse chain iteratively restores the raw signal from noise. Following the denoising score matching objective [48], we define the simplified loss function as

Section: Diffusion formulation. Diffusion models define two Markov chains
E t∼U [1,T ], X 0 ,C∼q(X 0 ,C) (∥X 0 -H(X t , t, C)∥ 2 ),(1)
where t denotes the time step, X 0 = X is the raw motion latent sequence, and X t is the noisy inputs generated by the diffusion forward process q(X t |X t-1 ) = N (X t ; √ 1 -β t X t-1 , β t I). H is our transformer network which predicts the raw signal itself instead of noise. C is the condition signal, to be described next.
Conditioning signals. The primary condition signal for our audio-driven motion generation task is the audio feature sequence A. We also incorporate several additional signals, which not only make the generative modeling more tractable but also increase the generation controllability. Specifically, we consider the main eye gaze direction g, head-to-camera distance d, and emotion offset e. The main gaze direction, g = (θ, ϕ), is defined by a vector in spherical coordinates. It specifies the focused direction of the generated talking face. We extract g for the training video clips using [2] on each frame followed by a simple histogram-based clustering algorithm. The head distance d is a normalized scalar controlling the distance between the face and the virtual camera, which affects the face scale in the generated face video. We obtain this scale label for the training videos using [17]. The emotion offset e modulates the depicted emotion on the talking face. Note that emotion is often intrinsically linked to and can be largely inferred from audio; hence, e serves only as a global offset added to enhance or moderately alter the emotion when required. It is not designed to achieve a total emotion shift during inference or produce emotions incongruent with the input audio. In practice, we use the averaged emotion coefficients extracted by [43] as our emotion signal.
In order to achieve a seamless transition between adjacent windows, we incorporate the last K frames of the audio feature and generated motions from the previous window as the condition of the current one. To summarize, our input condition can be denoted as C = [X pre , A pre ; A, g, d, e]. All conditions are concatenated with noise along the temporal dimension as the input to the transformer.
Classifier-free guidance (CFG) [28]. In the training stage, we randomly drop each of the input conditions. During inference, we apply
X0 = (1 + c∈C λ c ) • H(X t , t, C) - c∈C λ c • H(X t , t, C| c=∅ )(2)
where λ c is the CFG scale for condition c. C| c=∅ denotes that the condition c is replaced with ∅.
During training, we use a drop probability of 0.1 for each condition except for X pre and A pre for which we use 0.5. This is to ensure the model can well handle the first window with no preceding audio and motions (i.e., set to ∅). We also randomly drop the last few frames of A to ensure robust motion generation for audio sequences shorter than the window length.

Section: Talking Face Video Generation
At inference time, given an arbitrary face image and an audio clip, we first extract the 3D appearance volume V app and identity code z id using our trained face encoders. Then, we extract the audio features, split them into segments of length W , and generate the head and facial motion sequences
{X = {[z pose i , z dyn i
]}} one by one in a sliding-window manner using our trained diffusion transformer H. The final video can be generated subsequently using our trained decoder.

Section: Experiments
Implementation details. For face latent space learning, we use the public VoxCeleb2 dataset from [14] which contains talking face videos from about 6K subjects. We reprocess the dataset and discard the clips with multiple individuals and those of low quality using the method of [50]. For motion latent generation, we use an 8-layer transformer encoder with an embedding dim 512 and head number 8 as our diffusion network. The model is trained on VoxCeleb2 [14] and another high-resolution talk video dataset collected by us, which contains about 3.5K subjects. In our default setup, the model uses a forward-facing main gaze condition, an average head distance of all training videos, and an empty emotion offset condition. The CFG parameters are set to λ A = 0.5 and λ g = 1.0, and 50 sampling steps are used. Our face latent model takes around 7 days of training on a 4 NVIDIA RTX A6000 GPUs workstation, and the diffusion transformer takes around 3 days. The total data used for training comprises approximately 500K clips, each lasting between 2 to 10 seconds. The parameter counts of our 3D-aided face latent model and diffusion transformer model are about 200M and 29M respectively.
Evaluation benchmarks. We evaluate our method using two datasets. The first is a subset of VoxCeleb2 [14]. We randomly selected 46 subjects from the test split of VoxCeleb2 and randomly sampled 10 video clips for each subject, resulting in a total of 460 clips. These video clips are about 5∼15 seconds long (80% are less than 10 seconds), with most of the content being interviews and news reports. To further evaluate our method under long speech generation with a wider range of vocal variations, we further collected 32 one-minute clips of 17 individuals. These videos are predominantly sourced from online coaching sessions and educational lectures and the talking styles are considerably more diverse than VoxCeleb2. We refer to this dataset as OneMin-32.
Inference speed. Our method generates video frames of 512×512 size at 45fps in the offline batch processing mode, and can support up to 40fps in the online streaming mode with a preceding latency of only 170ms , evaluated on a desktop PC with a single NVIDIA RTX 4090 GPU.

Section: Quantitative Evaluation
Evaluation metrics. We use the following metrics for quantitative evaluation of our generated lip movement, head pose and overall video quality, including a new data-driven audio-pose synchronization metric trained in a way similar to CLIP [40]:
• Audio-lip synchronization. We use a pretrained audio-lip synchronization network, i.e., Sync-Net [15], to assess the alignment of the input audio with the generated lip movements in videos. Specifically, we compute the confidence score and feature distance as S C and S D respectively. Higher S C and lower S D indicate better audio-lip synchronization quality in general. • Audio-pose alignment. Measuring the alignment between the generated head poses and input audio is not trivial and there are no well-established metrics. A few recent studies [74,52] employed the Beat Align Score [45] to evaluate audio-pose alignment. However, this metric is not optimal because the concept of a "beat" in the context of natural speech and human head motion is ambiguous. In this work, we introduce a new data-driven metric called Contrastive Audio and Pose Pretraining (CAPP) score. Inspired by CLIP [40], we jointly train a pose sequence encoder and an audio sequence encoder and predict whether the input pose sequence and audio are paired. The audio encoder is initialized from a pretrained Wav2Vec2 network [3] and the pose encoder is a randomly initialized 6-layer transformer network. The input window size is 3 seconds. Our CAPP model is trained on 2K hours of real-life audio and pose sequences, and demonstrates a robust capability to assess the degree of synchronization between audio inputs and generate poses (see Sec. 4.3). • Pose variation intensity. We further define a pose variation intensity score ∆P which is the average of the pose angle differences between adjacent frames. Averaged over all generated frames, ∆P provides an indication of the overall head motion intensity generated by a method. • Video quality. Following previous video generation works [69,46], we use the Fréchet Video Distance (FVD) [57] to evaluate the generated video quality. We compute the FVD metric using sequences of 25 consecutive frames, at resolution of 224×224.
Compared methods. We compare our method with there existing audio-driven talking face generation methods: MakeItTalk [77], Audio2Head [62], and SadTalker [74].
Main results. For each audio input, we generate a single video for deterministic approaches, i.e., MakeItTalk and Audio2Head. For SadTalker and our method, we sample three videos for each audio and average the computed metrics. Since different pose representations are used by these methods, we re-extract the head poses from the generated frames to compute the pose-related metrics (i.e., CAPP and ∆P ). For the FVD metric, we use 2K 25-frame video clips of both the real videos and generated ones. For reference purpose, we also report the evaluated metrics of real videos.
Table 1 presents the results on the VoxCeleb2 and OneMin-32 benchmarks. Note that we did not evaluate the FVD on VoxCeleb2 as its video quality is varied and often low. On both benchmarks, our method achieves the best results among all methods on all evaluated metrics. In terms of audio-lip synchronization scores (S C and S D ), our method outperforms all others by a wide margin. Note that our method yields better scores than real videos, which is due to effect of the audio CFG (see Sec. 4.3). Our generated poses are better aligned with the audios especially on the OneMin-32 benchmark, as reflected by the CAPP scores. The head movements also exhibit the highest intensity according to ∆P , although there's still a gap to the intensity of real videos. Our FVD score is significantly lower than others, demonstrating the much higher video quality and realism of our results.

Section: Qualitative Evaluation
Visual results. Figure 1 presents some representative audio-driven talking face generation results of our method. Visually inspected, our method can generate high-quality video frames with vivid facial emotions. Moreover, it can generate human-like conversational behaviors, including sporadic shifts in eye gaze during speech and contemplation, as well as the natural and variable rhythm of eye blinking, among other nuances. We highly recommend that readers view our video results in the supplementary material to fully perceive the capabilities and output quality of our method.
Generation controllability. Figure 3 shows our generated results under different control signals including main eye gaze, head distance, and emotion offset. Our model can well interpret these signals and produce talking face results that closely adhere to these specified parameters.
Disentanglement of face latents. Figure A.1 shows that when applying the same motion latent sequences onto different subjects, our method effectively maintains both the distinct facial movements and the unique facial identities. This indicates the efficacy of our method in disentangling identity and motion.  

Section: Analysis and Ablation Study
CAPP metric. We analyze the effectiveness of our proposed CAPP metric in measuring the alignment between audio and head pose. First, we study its sensitivity to temporal shifting by manually introducing frame offsets to ground-truth audio-pose pairs. We extract 3-second clip segments from the VoxCeleb2 test split, yielding approximately 2.1K audio-pose pairs. The average CAPP score for these pairs is 0.608, as shown in Table 2. Manual frame shifts lead to a rapid decline in CAPP scores, approaching zero for shifts larger than two frames. This indicates a robust correlation between CAPP scores and audio-head pose alignment. We further investigate the effect of head movement intensity on CAPP by manually scaling the pose differences between consecutive frames using various factors. Table 3 shows that altering movement intensity negatively impacts the CAPP scores, demonstrating CAPP can assess the alignment of audio and pose in terms of their intensity. However, this sensitivity to intensity appears less pronounced than that to temporal misalignment. Table 4: Ablation study of the audio and main gaze CFG scales as well as the sampling steps. E g denotes the average angular error of main gaze directions and E s is the average head distance error. CFG scales for diffusion model. The CFG strategy [28] for diffusion models can attain a trade-off between sample quality and diversity. Here we evaluate the choice of the CFG scales for the audio and main gaze conditions (i.e., λ A and λ g in Eq. 2) in our model.
As shown in Table 4, as we increase the value of λ g , the accuracy of gaze control improves. Increasing the audio CFG scale to λ A = 0.5 significantly enhances the performance of lip-audio alignment (S C and S D ), pose-audio alignment (CAPP), and pose variation intensity (∆P ). With positive audio CFG, the lip-audio alignment scores even surpass those evaluated on real videos (the results without audio CFG, i.e., λ A = 0, were slightly worse than or comparable to them). Moreover, the FVD score shows a slight drop which indicates slightly better video quality.
Further increasing λ A marginally improves lip-audio synchronization and reduces FVD 25 , but at the cost of slightly degrading audio-pose synchronization and gaze controllability. In addition, observations from the generated videos indicate that a higher λ A significantly amplifies mouth movements for strong vocals and causes head pose jitter during rapid speech. For balanced performance and overall generation quality, we set λ A = 0.5 and λ g = 1.0 as our standard configuration.
We also evaluated the influence of sampling steps on performance. Table 4 illustrates that decreasing the steps from 50 to 10 improves audio-lip and audio-pose alignment while compromising pose variation intensity and overall video quality. This step reduction could accelerate the inference process by a factor of 5 for this latent motion generation module.
Training data scale. To validate the data scale influence and compare our model with previous methods at similar scales, we trained a diffusion model using only 10% of the data (i.e., 50K clips).
As shown in Table 1, the model trained on this reduced dataset demonstrates comparable audio-lip and audio-pose synchronization to the full-dataset model, although the FVD and ∆p metrics are not as good. Nonetheless, it still significantly outperforms previous methods across all metrics assessing synchronization, motion intensity, and video quality. This indicates that our approach remains highly effective even with much less data, and that increasing the dataset size enhances motion diversity.
Losses for latent space learning. As described in Sec. 3.1, we introduce new losses l consist and l cross_id to improve the disentanglement of facial dynamics, head pose, and face identity. To validate the effectiveness of l consist , we transfer only facial dynamics from the source image to the target while keeping the target's pose unchanged. Figure 4 shows that without l consist , the latent model may struggle to replicate subtle facial dynamics such as side glances and lip asymmetries which are oftentimes coupled with head poses (e.g., a skewed mouth may coincide with a tilted head, and the gaze direction usually aligns with the head's pose). Decoupling these subtle yet important dynamics are challenging without explicit constraints from l consist .
We also evaluate the face identity loss l cross_id for cross identity driving during training. We use all 108 subjects from the VoxCeleb2 test set for evaluation. For each subject, we chose the image that is closest to a frontal view from the first frames of all its clips to serve as the target image. Then we randomly selected 50 clips of other subjects as source videos, which leads to a total of 5,400 cross-reenactment clips. We calculate the facial identity preservation score by averaging the facial identity feature cosine similarity over all generated frames of all subjects. With the introduced face identity loss l cross_id , this identity preservation score of our results improved from 0.72 to 0.80.

Section: Conclusion
In summary, our work presents an audio-driven talking face generation model that stands out for its efficient generation of realistic lip synchronization, vivid facial expressions, and naturalistic head movements from a single image and audio input. It significantly outperforms existing methods in delivering video quality and performance efficiency, demonstrating promising visual affective skills in the generated face videos. The technical cornerstone is an innovative holistic facial dynamics and head movement generation model that works in an expressive and disentangled face latent space.
The advancements made by VASA-1 have the potential to reshape human-human and human-AI interactions across various domains, including communication, education, and healthcare. The integration of controllable conditioning signals further enhances the model's adaptability for personalized user experiences.
There are still several limitations with our method. Currently, it processes human regions only up to the torso. Extending to the full upper body could offer additional capabilities. While utilizing 3D latent representations, the absence of a more explicit 3D face model such as [66,67] may result in artifacts like texture sticking due to the neural rendering. Additionally, our approach does not account for non-rigid elements like hair and clothing, which could be addressed with a stronger video prior. In the future, we also plan to incorporate more diverse talking styles and emotions to improve expressiveness and control.

Section: Contribution statement
Sicheng Xu, Guojun Chen, Yu-Xiao Guo were the core contributors to the implementation, training, and experimentation of various algorithm modules, as well as the data processing and management. Jiaolong Yang initiated the project idea, led the project, designed the overall framework, and provided detailed technical advice to each component. Chong Li, Zhengyu Zang and Yizhong Zhang contributed to enhancing the system quality, conducting evaluations, and demonstrating results. Xin Tong provided technical advice throughout the project and helped with project coordination. Baining Guo offered strategic research direction guidance, scientific advising, and other project supports. Paper written by Jiaolong Yang and Sicheng Xu.

Section: Claims
Question: Does the abstract and introduction clearly state the claims made in the paper?
Answer: [Yes] Justification: Yes, the claims match the paper's contributions and scope.
Guidelines:
• The answer NA means that the abstract and introduction do not include the claims made in the paper.
• The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.
• The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.
• It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

Section: Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer: [Yes] Justification: We discuss the limitations in the Section 5.
Guidelines:
• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. • The authors are encouraged to create a separate "Limitations" section in their paper.
• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. • The authors should reflect on the factors that influence the performance of the approach.
For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

Section: Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
Answer: [NA] Justification: This paper does not include theoretical results.
Guidelines:
• The answer NA means that the paper does not include theoretical results.
• All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced.
• All assumptions should be clearly stated or referenced in the statement of any theorems.
• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.
• Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.
• Theorems and Lemmas that the proof relies upon should be properly referenced.

Section: Broader Impacts
Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
Answer: [Yes] Justification: We discussed the positive and negative societal impacts in Section A.
Guidelines:
• The answer NA means that there is no societal impact of the work performed.
• If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.
• The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

Section: Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
Answer: [Yes]
Justification: Currently we have no plan to release the model or data to avoid potential misuse. We also discussed the development of safeguards in Section A.
Guidelines:
• The answer NA means that the paper poses no such risks.
• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [Yes] Justification: We cited the paper for the model/dataset we used in our paper.
Guidelines:
• The answer NA means that the paper does not use existing assets.
• The authors should cite the original paper that produced the code package or dataset.
• The authors should state which version of the asset is used and, if possible, include a URL. • The name of the license (e.g., CC-BY 4.0) should be included for each asset.
• For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

Section: Acknowledgments
We would like to thank our colleagues Zheng Zhang, Zhirong Wu, Shujie Liu, Dong Chen, Xu Tan, Yu Deng, Lidong Zhou, and others for the valuable discussions and insightful suggestions for our project.

Section: A Societal Impacts and Responsible AI Considerations
Our research focuses on generating audio-driven visual affective skills for virtual AI avatars, aiming for positive applications. It is not intended to create content that is used to mislead or deceive. However, like other related content generation techniques, it could still potentially be misused for impersonating humans. We are opposed to any behavior that creates misleading or harmful contents of real persons. Currently, the videos generated by this method still contain identifiable artifacts, and the numerical study in Section 4 shows that there's still a gap to achieve the authenticity of real videos. Furthermore, we have trained a neural network based detector to distinguish real videos and those generated by our VASA-1, and the detector shows a 97.8% accuracy for this task.
While acknowledging the possibility of misuse, it's imperative to recognize the substantial positive potential of our technique. The benefits -ranging from enhancing educational equity, improving accessibility for individuals with communication challenges, and offering companionship or therapeutic support to those in need -underscore the importance of our research and other related explorations. We are dedicated to developing AI responsibly, with the goal of advancing human well-being.
To combat potential misuse of our technique and other related ones and provide necessary safeguards, we are also working on applying our method for advancing face media forgery detection. Specifically, we are training generic face forgery detection models that incorporate our generated talking face videos as part of the training data. Our preliminary exploration shows that using our method to generate training data can lead to an obvious improvement of generality for the forgery detection models, and we'll keep the community updated on new progresses. Justification: This paper does not include theoretical results. Guidelines:

Section: B More Qualitative Evaluation, Comparison and Ablation Study
See

Section: Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
Answer: [Yes] , Justification: Yes, the paper has provided the necessary information. Guidelines:
• The answer NA means that the paper does not include experiments.
• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. • Depending on the contribution, reproducibility can be accomplished in various ways.
For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility.
In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

Section: Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
Answer: [No] Justification: Based on the RAI considerations, we will not release our code or data in case of potential misuse, as discussed in Section A.
Guidelines:
• The answer NA means that paper does not include experiments requiring code.
• Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.
• While we encourage the release of code and data, we understand that this might not be possible, so "No" is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).
• The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https: //nips.cc/public/guides/CodeSubmissionPolicy) for more details.
• The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.
• The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.
• At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).
• Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

Section: New Assets
Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: We will not release new assets. Guidelines:
• The answer NA means that the paper does not release new assets.
• Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. • The paper should discuss whether and how consent was obtained from people whose asset is used. • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.


References:
[b0] Thorsten Ahmed A Abdelrahman; Aly Hempel; Ayoub Khalifa; Laslo Al-Hamadi;  Dinges (2023). L2cs-net: Fine-grained gaze estimation in unconstrained environments. IEEE
[b1] Alexei Baevski; Yuhao Zhou; Abdelrahman Mohamed; Michael Auli (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems
[b2] Omer Bar-Tal; Hila Chefer; Omer Tov; Charles Herrmann; Roni Paiss; Shiran Zada; Ariel Ephrat; Junhwa Hur; Yuanzhen Li; Tomer Michaeli (2024). Lumiere: A space-time diffusion model for video generation. 
[b3] James Betker; Gabriel Goh; Li Jing; Tim Brooks; Jianfeng Wang; Linjie Li; Long Ouyang; Juntang Zhuang; Joyce Lee; Yufei Guo (). Improving image generation with better captions. 
[b4] Andreas Blattmann; Tim Dockhorn; Sumith Kulal; Daniel Mendelevitch; Maciej Kilian; Dominik Lorenz; Yam Levi; Zion English; Vikram Voleti; Adam Letts (2023). Stable video diffusion: Scaling latent video diffusion models to large datasets. 
[b5] Andreas Blattmann; Robin Rombach; Huan Ling; Tim Dockhorn; Seung Wook Kim; Sanja Fidler; Karsten Kreis (2023). Align your latents: High-resolution video synthesis with latent diffusion models. 
[b6] Aras Bozkurt; Xiao Junhong; Sarah Lambert; Angelica Pazurek; Helen Crompton; Suzan Koseoglu; Robert Farrow; Melissa Bond; Chrissi Nerantzi; Sarah Honeychurch (2023). Speculative futures on chatgpt and generative artificial intelligence (ai): A collective reflection from the educational landscape. Asian Journal of Distance Education
[b7] Tim Brooks; Bill Peebles; Connor Holmes; Will Depue; Yufei Guo; Li Jing; David Schnurr; Joe Taylor; Troy Luhman; Eric Luhman; Clarence Ng; Ricky Wang; Aditya Ramesh (2024). Video generation models as world simulators. 
[b8] Tom Brown; Benjamin Mann; Nick Ryder; Melanie Subbiah; Jared D Kaplan; Prafulla Dhariwal; Arvind Neelakantan; Pranav Shyam; Girish Sastry; Amanda Askell (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems
[b9] Egor Burkov; Igor Pasechnik; Artur Grigorev; Victor Lempitsky (2020). Neural head reenactment with latent pose descriptors. 
[b10] Lele Chen; Zhiheng Li; Ross K Maddox; Zhiyao Duan; Chenliang Xu (2018). Lip movements generation at a glance. 
[b11] Kun Cheng; Xiaodong Cun; Yong Zhang; Menghan Xia; Fei Yin; Mingrui Zhu; Xuan Wang; Jue Wang; Nannan Wang (2022). Videoretalking: Audio-based lip synchronization for talking head video editing in the wild. 
[b12] Son Joon; Arsha Chung; Andrew Nagrani;  Zisserman (2018). Voxceleb2: Deep speaker recognition. 
[b13] Joon Son; Chung ; Andrew Zisserman (2017). Out of time: automated lip sync in the wild. Springer
[b14] Jiankang Deng; Jia Guo; Niannan Xue; Stefanos Zafeiriou (2019). Arcface: Additive angular margin loss for deep face recognition. 
[b15] Yu Deng; Jiaolong Yang; Sicheng Xu; Dong Chen; Yunde Jia; Xin Tong (2019). Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. 
[b16] Nikita Drobyshev; Antoni Bigata Casademunt; Konstantinos Vougioukas; Zoe Landgraf; Stavros Petridis; Maja Pantic (2024). Emoportraits: Emotion-enhanced multimodal one-shot head avatars. 
[b17] Nikita Drobyshev; Jenya Chelishev; Taras Khakhulin; Aleksei Ivakhnenko; Victor Lempitsky; Egor Zakharov (2022). Megaportraits: One-shot megapixel neural head avatars. 
[b18] Chenpeng Du; Qi Chen; Tianyu He; Xu Tan; Xie Chen; Kai Yu; Sheng Zhao; Jiang Bian (2023). Dae-talker: High fidelity speech-driven talking face generation with diffusion autoencoder. 
[b19] Yingruo Fan; Zhaojiang Lin; Jun Saito; Wenping Wang; Taku Komura (2022). Faceformer: Speech-driven 3d facial animation with transformers. 
[b20] Yue Gao; Yuan Zhou; Jinglu Wang; Xiao Li; Xiang Ming; Yan Lu (2023). High-fidelity and freely controllable talking head video generation. 
[b21] Rohit Girdhar; Mannat Singh; Andrew Brown; Quentin Duval; Samaneh Azadi; Sai Saketh Rambhatla; Akbar Shah; Xi Yin; Devi Parikh; Ishan Misra (2023). Emu video: Factorizing text-to-video generation by explicit image conditioning. 
[b22] Ian Goodfellow; Jean Pouget-Abadie; Mehdi Mirza; Bing Xu; David Warde-Farley; Sherjil Ozair; Aaron Courville; Yoshua Bengio (2014). Generative adversarial nets. Advances in Neural Information Processing Systems
[b23] Yudong Guo; Keyu Chen; Sen Liang; Yong-Jin Liu; Hujun Bao; Juyong Zhang (2021). Ad-nerf: Audio driven neural radiance fields for talking head synthesis. 
[b24] Tianyu He; Junliang Guo; Runyi Yu; Yuchi Wang; Jialiang Zhu; Kaikai An; Leyi Li; Xu Tan; Chunyu Wang; Han Hu (2024). Gaia: Zero-shot talking avatar generation. 
[b25] Jonathan Ho; Ajay Jain; Pieter Abbeel (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems
[b26] Jonathan Ho; Tim Salimans (2022). Classifier-free diffusion guidance. 
[b27] Esperanza Johnson; Ramón Hervás; Carlos Gutiérrez López De La Franca; Tania Mondéjar; Sergio F Ochoa; Jesús Favela (2018). Assessing empathy and managing emotions through interactions with an affective avatar. Health informatics journal
[b28] Tero Karras; Samuli Laine; Miika Aittala; Janne Hellsten; Jaakko Lehtinen; Timo Aila (2020). Analyzing and improving the image quality of stylegan. 
[b29] Greg Kessler (2018). Technology and the future of language teaching. Foreign Language Annals
[b30] Dan Kondratyuk; Lijun Yu; Xiuye Gu; José Lezama; Jonathan Huang; Rachel Hornung; Hartwig Adam; Hassan Akbari; Yair Alon; Vighnesh Birodkar (2023). Videopoet: A large language model for zero-shot video generation. 
[b31] Julian Leff; Geoffrey Williams; Mark Huckvale; Maurice Arbuthnot; Alex P Leff (2014). Avatar therapy for persecutory auditory hallucinations: What is it and how does it work?. Psychosis
[b32] Borong Liang; Yan Pan; Zhizhi Guo; Hang Zhou; Zhibin Hong; Xiaoguang Han; Junyu Han; Jingtuo Liu; Errui Ding; Jingdong Wang (2022). Expressive talking head generation with granular audio-visual control. 
[b33] Shugao Ma; Tomas Simon; Jason Saragih; Dawei Wang; Yuecheng Li; Fernando De ; La Torre; Yaser Sheikh (2021). Pixel codec avatars. 
[b34] Yifeng Ma; Suzhen Wang; Zhipeng Hu; Changjie Fan; Tangjie Lv; Yu Ding; Zhidong Deng; Xin Yu (2023). Styletalk: One-shot talking head generation with controllable speaking styles. 
[b35] Youxin Pang; Yong Zhang; Weize Quan; Yanbo Fan; Xiaodong Cun; Ying Shan; Dong-Ming Yan (2023). Dpe: Disentanglement of pose and expression for general video portrait editing. 
[b36] William Peebles; Saining Xie (2023). Scalable diffusion models with transformers. 
[b37] Rudrabha Kr Prajwal;  Mukhopadhyay; P Vinay;  Namboodiri;  Jawahar (2020). A lip sync expert is all you need for speech to lip generation in the wild. 
[b38] Alec Radford; Jong Wook Kim; Chris Hallacy; Aditya Ramesh; Gabriel Goh; Sandhini Agarwal; Girish Sastry; Amanda Askell; Pamela Mishkin; Jack Clark (2021). Learning transferable visual models from natural language supervision. PMLR
[b39] Imogen C Rehm; Emily Foenander; Klaire Wallace; Jo-Anne M Abbott; Michael Kyrios; Neil Thomas (2016). What role can avatars play in e-mental health interventions? exploring new models of client-therapist interaction. Frontiers in Psychiatry
[b40] Ge Yurui Ren; Yuanqi Li; Thomas H Chen; Shan Li;  Liu (2021). PIRenderer: Controllable portrait image generation via semantic neural rendering. 
[b41] Andrey V Savchenko (2022). Hsemotion: High-speed emotion recognition library. Software Impacts
[b42] Aliaksandr Siarohin; Stéphane Lathuilière; Sergey Tulyakov; Elisa Ricci; Nicu Sebe (2019). First order motion model for image animation. 
[b43] Li Siyao; Weijiang Yu; Tianpei Gu; Chunze Lin; Quan Wang; Chen Qian; Chen Change Loy; Ziwei Liu (2022). Bailando: 3d dance generation by actor-critic gpt with choreographic memory. 
[b44] Ivan Skorokhodov; Sergey Tulyakov; Mohamed Elhoseiny (2022). Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. 
[b45] Jiaming Song; Chenlin Meng; Stefano Ermon (2020). Denoising diffusion implicit models. 
[b46] Yang Song; Jascha Sohl-Dickstein; P Diederik; Abhishek Kingma; Stefano Kumar; Ben Ermon;  Poole (2020). Score-based generative modeling through stochastic differential equations. 
[b47] Michał Stypułkowski; Konstantinos Vougioukas; Sen He; Maciej Zięba; Stavros Petridis; Maja Pantic (2024). Diffused heads: Diffusion models beat gans on talking-face generation. 
[b48] Shaolin Su; Qingsen Yan; Yu Zhu; Cheng Zhang; Xin Ge; Jinqiu Sun; Yanning Zhang (2020). Blindly assess image quality in the wild guided by a self-adaptive hyper network. 
[b49] Yasheng Sun; Hang Zhou; Ziwei Liu; Hideki Koike (2021). Speech2talking-face: Inferring and driving a face with synchronized audio-visual representation. 
[b50] Zhiyao Sun; Tian Lv; Sheng Ye; Matthieu Gaetan Lin; Jenny Sheng; Yu-Hui Wen; Minjing Yu; Yong-Jin Liu (2023). Diffposetalk: Speech-driven stylistic 3d facial animation and head pose generation via diffusion models. 
[b51] Supasorn Suwajanakorn; Steven M Seitz; Ira Kemelmacher-Shlizerman (2017). Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics
[b52] Shuai Tan; Bin Ji; Mengxiao Bi; Ye Pan (2024). Edtalk: Efficient disentanglement for emotional talking head synthesis. 
[b53] Linrui Tian; Qi Wang; Bang Zhang; Liefeng Bo (2024). Emo: Emote portrait alive-generating expressive portrait videos with audio2video diffusion model under weak conditions. 
[b54] Sergey Tulyakov; Ming-Yu Liu; Xiaodong Yang; Jan Kautz (2018). Mocogan: Decomposing motion and content for video generation. 
[b55] Thomas Unterthiner; Karol Sjoerd Van Steenkiste; Raphaël Kurach; Marcin Marinier; Sylvain Michalski;  Gelly (2019). Fvd: A new metric for video generation. 
[b56] Ashish Vaswani; Noam Shazeer; Niki Parmar; Jakob Uszkoreit; Llion Jones; Aidan N Gomez; Łukasz Kaiser; Illia Polosukhin (2017). Attention is all you need. Advances in Neural Information Processing Systems
[b57] Carl Vondrick; Hamed Pirsiavash; Antonio Torralba (2016). Generating videos with scene dynamics. Advances in Neural Information Processing Systems
[b58] Duomin Wang; Yu Deng; Zixin Yin; Heung-Yeung Shum; Baoyuan Wang (2023). Progressive disentangled representation learning for fine-grained controllable talking head synthesis. 
[b59] Jiayu Wang; Kang Zhao; Shiwei Zhang; Yingya Zhang; Yujun Shen; Deli Zhao; Jingren Zhou (2023). Lipformer: High-fidelity and generalizable talking face generation with a pre-learned facial codebook. 
[b60] Suzhen Wang; Lincheng Li; Yu Ding; Changjie Fan; Xin Yu (2021). Audio2head: Audio-driven oneshot talking-head generation with natural head motion. 
[b61] Suzhen Wang; Lincheng Li; Yu Ding; Xin Yu (2022). One-shot talking face generation from single-speaker audio-visual correlation learning. 
[b62]  Ting-Chun; Arun Wang; Ming-Yu Mallya;  Liu (2021). One-shot free-view neural talking-head synthesis for video conferencing. 
[b63] Huawei Wei; Zejun Yang; Zhisheng Wang (2024). Aniportrait: Audio-driven synthesis of photorealistic portrait animation. 
[b64] Yue Wu; Yu Deng; Jiaolong Yang; Fangyun Wei; Qifeng Chen; Xin Tong (2022). Anifacegan: Animatable 3d-aware face image generation for video avatars. Advances in Neural Information Processing Systems
[b65] Yue Wu; Sicheng Xu; Jianfeng Xiang; Fangyun Wei; Qifeng Chen; Jiaolong Yang; Xin Tong (2023). Aniportraitgan: Animatable 3d portrait generation from 2d image collections. 
[b66] Jinbo Xing; Menghan Xia; Yuechen Zhang; Xiaodong Cun; Jue Wang; Tien-Tsin Wong (2023). Codetalker: Speech-driven 3d facial animation with discrete motion prior. 
[b67] Wilson Yan; Yunzhi Zhang; Pieter Abbeel; Aravind Srinivas (2021). Videogpt: Video generation using vq-vae and transformers. 
[b68] Fei Yin; Yong Zhang; Xiaodong Cun; Mingdeng Cao; Yanbo Fan; Xuan Wang; Qingyan Bai; Baoyuan Wu; Jue Wang; Yujiu Yang (2022). Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. 
[b69] Zhentao Yu; Zixin Yin; Deyu Zhou; Duomin Wang; Finn Wong; Baoyuan Wang (2023). Talking head generation with probabilistic audio-to-visual diffusion priors. 
[b70] Egor Zakharov; Aleksei Ivakhnenko; Aliaksandra Shysheya; Victor Lempitsky (2020). Fast bi-layer neural synthesis of one-shot realistic head avatars. 
[b71] Bowen Zhang; Chenyang Qi; Pan Zhang; Bo Zhang; Hsiangtao Wu; Dong Chen; Qifeng Chen; Yong Wang; Fang Wen (2023). Metaportrait: Identity-preserving talking head generation with fast personalized adaptation. 
[b72] Wenxuan Zhang; Xiaodong Cun; Xuan Wang; Yong Zhang; Xi Shen; Yu Guo; Ying Shan; Fei Wang (2023). Sadtalker: Learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation. 
[b73] Zhimeng Zhang; Lincheng Li; Yu Ding; Changjie Fan (2021). Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. 
[b74] Hang Zhou; Yasheng Sun; Wayne Wu; Chen Change Loy; Xiaogang Wang; Ziwei Liu (2021). Pose-controllable talking face generation by implicitly modularized audio-visual representation. 
[b75] Yang Zhou; Xintong Han; Eli Shechtman; Jose Echevarria; Evangelos Kalogerakis; Dingzeyu Li (2020). Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG)

Figures:
Figure fig_0: 2
Type: figure
Caption: Figure 2 :2Figure 2: Our holistic facial dynamics and head pose generation framework with diffusion transformer.
Data: 

Figure fig_1: 
Type: figure
Caption: Figure A.2 further illustrates the effective disentanglement between head pose and facial
Data: 

Figure fig_2: 3
Type: figure
Caption: Figure 3 :3Figure 3: Generated talking faces under different control signals. Top row: results under different main gaze direction condition (forward-facing, leftwards, rightwards, and upwards, respectively). Middle row: results under different head distances (from far to near). Bottom row: results under different emotion offset (neutral, happy, angry and surprised, respectively). dynamics. By holding one aspect constant and changing the other, the resulting images faithfully reflect the intended head and facial motions without interference. Out-of-distribution generation. Our method exhibits the capability to handle photo and audio inputs that fall outside the training distribution, such as artistic photos, singing audio clips, and non-English speech, as illustrated in Figure A.3. Comparison with other methods. Some visual examples from different methods are presented in Figure A.4 A.5 A.6 A.7. Our method outperforms the others in terms of the precise audio-lip synchronization and delivers much more vivid and natural facial dynamics and head movements.
Data: 

Figure fig_3: 4
Type: figure
Caption: Figure 4 :4Figure 4: Ablation study on loss function l consist for disentangled latent space learning. We generate the results by only transferring the facial dynamics from source to target with head pose unchanged. l consist is crucial for decoupling subtle yet important facial dynamics from head pose.
Data: 

Figure fig_4: 1
Type: figure
Caption: Figure A. 1 :1Figure A.1: Disentanglement between identity and motion. In these examples, the same generated head and facial motion sequences are applied onto three different face images.
Data: 

Figure fig_5: 2
Type: figure
Caption: Figure A. 2 :2Figure A.2: Disentanglement between head pose and facial dynamics. From top to bottom: the raw generated sequence, applying generated poses with fixed initial facial dynamics, and applying generated facial dynamics with fixed initial head pose and pre-defined spinning poses, respectively.
Data: 

Figure fig_6: 3
Type: figure
Caption: Figure A. 3 :3Figure A.3: Generation results with out-of-distribution images (non-photorealistic) and audios(singing audios for the first two rows and non-English speech for the last row). Our method can still generate high quality videos well-aligned with the audios, although it was not trained on such data variations. See the supplementary video with audio for a better illustration of these results.
Data: 

Figure fig_7: 4
Type: figure
Caption: Figure A. 4 :4Figure A.4: Generation results from different methods with the input audio segment uttering "push ups". See our supplementary video for a better illustration and comparison.
Data: 

Figure fig_8: 5
Type: figure
Caption: Figure A. 5 :5Figure A.5: Generation results from different methods with the input audio segment uttering "excruciating". See our supplementary video for a better illustration and comparison.
Data: 

Figure fig_9: 6
Type: figure
Caption: Figure A. 6 :6Figure A.6: Generation results from different methods with the input audio segment uttering "what?". See our supplementary video for a better illustration and comparison.
Data: 

Figure fig_10: 7
Type: figure
Caption: Figure A. 7 :7Figure A.7: Generation results from different methods with the input audio segment uttering "lots of questions". See our supplementary video for a better illustration and comparison.
Data: 

Figure : 
Type: figure
Caption: 
Data: 

Figure tab_0: 1
Type: table
Caption: Quantitative comparison with previous methods on two benchmarks.
Data: VoxCeleb2OneMin-32S C ↑S D ↓ CAPP↑∆PS C ↑S D ↓ CAPP↑∆PFVD 25 ↓MakeItTalk4.17615.513 -0.0510.210-0.123 14.3400.0020.190304.83Audio2Head6.1728.4700.2460.2605.9928.2110.2050.239209.77SadTalker5.8438.8130.4410.2755.5018.8500.3830.252214.51Ours8.841 6.312 0.468 0.304 7.957 6.635 0.465 0.316 105.88Ours (10% data)8.8186.2980.4570.2297.9906.6450.4410.229 147.401Real video7.6407.1890.5880.5057.1927.2540.5590.40529.25

Figure tab_1: 2
Type: table
Caption: CAPP under frame shifting
Data: 0±1±2±3±40.608 0.462 0.206 0.069 0.082

Figure tab_2: 3
Type: table
Caption: CAPP under pose variation scaling
Data: ×0.2 ×0.5 ×1.0 ×1.5 ×3.00.368 0.584 0.608 0.587 0.505



Formulas:
Formula formula_0: I i as Îi,z dyn j = D(V app i , z id i , z pose i , z dyn j

Formula formula_1: I d onto I s and obtain Îs,z pose d ,z dyn d = D(V app s , z id s , z pose d , z dyn d ).

Formula formula_2: X = {[z pose i , z dyn i ]}, i = 1, . . . , W .

Formula formula_3: E t∼U [1,T ], X 0 ,C∼q(X 0 ,C) (∥X 0 -H(X t , t, C)∥ 2 ),(1)

Formula formula_4: X0 = (1 + c∈C λ c ) • H(X t , t, C) - c∈C λ c • H(X t , t, C| c=∅ )(2)

Formula formula_5: {X = {[z pose i , z dyn i
