Multi-view Ancestral Sampling (MAS) is using a 2D motion diffusion model to generate novel high-quality 3D motions. This technique enables learning intricate motions from monocular data only.
We introduce Multi-view Ancestral Sampling (MAS), a method for generating consistent multi-view 2D samples of a motion sequence, enabling the creation of a coherent 3D counterpart. While abundant 2D samples are readily available, such as those found in videos, 3D data collection is both involved and expensive, often requiring specialized motion-capture systems. MAS leverages diffusion models trained solely on 2D data to produce coherent and realistic 3D motions. This is achieved by simultaneously applying multiple ancestral samplings to denoise multiple 2D sequences representing the same motion from different angles. Our consistency block ensures 3D consistency at each diffusion step by combining the individual generations into a unified 3D sequence, which is then projected back to the original views. We evaluate MAS using 2D pose data from intricate and unique motions, including professional basketball maneuvers, rhythmic gymnastic performances featuring ball apparatus routines, and horse obstacle course races. In each of these domains, MAS generates diverse, high-quality, and unprecedented 3D sequences that would otherwise require expensive equipment and intensive human labor to obtain.