Keywords: deep learning, imaga animation, generative modeling
Abstract: We propose novel motion representations for animating articulated objects consisting of distinct parts. In a completely unsupervised manner, our method identifies meaningful object parts, tracks them in a driving video, and infers their motions by considering their principal axes. In contrast to the previous keypoint-based works, our method extracts meaningful and consistent regions, describing locations, shape, and pose. The regions correspond to semantically relevant and distinct object parts, that are more easily detected in frames of the driving video. To force decoupling of foreground from background, we model non-object related global motion with a homography. Our model can animate a variety of objects, surpassing previous methods by a large margin on existing benchmarks. We present a challenging new benchmark with high-resolution videos and show that the improvement is particularly pronounced when articulated objects are considered.
One-sentence Summary: We propose an expricit region representation instead of the regressed transformations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Supplementary Material: zip
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/arxiv:2104.11280/code)
Reviewed Version (pdf): https://openreview.net/references/pdf?id=CQDukNxX2
10 Replies
Loading