OSM-Net: One-to-Many One-shot Talking Head Generation with Spontaneous Head Motions
Abstract: One-shot talking head generation has no explicit head movement reference, thus it is difficult to generate talking heads with head motions. Some existing works only edit the mouth area and generate still talking heads, leading to unreal talking head performance. Other works construct one-to-one mapping between audio signal and head motion sequences, introducing ambiguity correspondences into the mapping since people can behave differently in head motions when speaking the same content. This unreasonable mapping form fails to model the diversity and produces either nearly static or even exaggerated head motions, which are unnatural and strange. Therefore, the
one-shot talking head generation task is actually a one-to-many ill-posed problem and people present diverse head motions when
speaking. Based on the above observation, we propose OSM-Net, a one-to-many one-shot talking head generation network with
natural head motions. OSM-Net constructs a motion space that contains rich and various clip-level head motion features. Each
basis of the space represents a feature of meaningful head motion in a clip rather than just a frame, thus providing more coherent
and natural motion changes in talking heads. The driving audio is mapped into the motion space, around which various motion
features can be sampled within a reasonable range to achieve the one-to-many mapping. Besides, the landmark constraint and time
window feature input improve the accurate expression feature
extraction and video generation. Extensive experiments show that
OSM-Net generates more natural realistic head motions under
reasonable one-to-many mapping paradigm compared with other
methods.
Loading