Abstract: Motion transfer is to transfer pose in driving video to object of source image, so that object of source image moves. Although great progress has been made recently in unsupervised motion transfer, many unsupervised methods still struggle to accurately model large displacement motions when large motion differences occur between source and driving images. To solve the problem, we propose an unsupervised anytime interpolation based large displacement motion transfer method, which can generate a series of anytime interpolated images between source and driving images. By decomposing large displacement motion into many small displacement motions, difficulty of large displacement motion estimation is reduced. In the process, we design a selector that can select optimal interpolated image from generated interpolated images for downstream tasks. Since there are no real images as labels in the interpolation process, we propose a bidirectional training strategy. Some constraints are added to optimal interpolated image to generate a reasonable interpolated image. To encourage network to generate high-quality images, a pre-trained Vision Transformer model is used to design constraint losses. Finally, experiments show that compared with the large displacement motion between source and driving images, small displacement motion between interpolated and driving images is easier to realize motion transfer. Compared with existing state-of-art methods, our method has significant improvements in motion-related metrics.
Lay Summary: We perform unsupervised interpolation between the source image and the driving image, as a way to explore whether the interpolated image can solve the problem that the pose difference between the source image and the driving image is large and difficult to model.
We use the keypoint information to interpolate, so as to generate multiple interpolated images, and then select the optimal interpolated image. The optimal interpolated image decomposes the complex large displacement motion between the source image and the driving image into two simple small displacement motions. To enhance realism, we integrate constraints derived from a pre-trained Vision Transformer (ViT) to guide texture and motion coherence.
By reducing large displacements to smaller steps, our method achieves very good results on motion metrics, in some cases even better than existing methods. Our study has important implications for how to learn the pose of the driving image and preserve the appearance of the source image.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Applications->Computer Vision
Keywords: Image animation, action generation, Unsupervised Interpolation
Flagged For Ethics Review: true
Submission Number: 15058
Loading