New approaches for predicting and generating human motions from 3D skeletons: application to non-verbal social interactions in virtual reality. (Nouvelles approches pour la prédiction et la génération de mouvement humain utilisant des squelettes 3D : application aux interactions non-verbales en réalité virtuelle)

Baptiste Chopin

Published: 01 Jan 2023, Last Modified: 05 Nov 2023undefined 2023Readers: Everyone

Abstract: In this thesis, we address various tasks for generating 3D skeletons of humans in motion. The ability to predict and generate human motion has become an important topic in recent years in many domains including self-driving vehicles, animation, and virtual reality. While in recent years deep learning has greatly increased the performance of generative models, the generation of human motion remains an open issue. Even the more recent methods still struggle to generate high-quality human motion. This is due to the need to model both spatial and temporal components and of understanding the interactions of human body parts. The task is also challenging due to the high variability of motions both in terms of time since the same motion can be performed at a different speed, and in terms of space, since the amplitude of motion can vary greatly. Furthermore, the generated 3D motions must be accurate, realistic, and smooth. We propose a new predictive Wasserstein generative adversarial network (GAN) to predict the end of a person's motion. Our predictive network uses the SRVF representation to modelize human motion and allow the prediction of accurate motion without discontinuities in real-time as shown in our experiments against state-of-the-art methods. We then work on the generation of interaction motions between two persons. We present a new method to generate a reaction motion in response to an action. Unlike the state of the art methods that focus on generating the motion of a single person, we propose Interformer, a Transformer to predict the reaction to an action using the temporal modeling abilities of the Transformer network as well as new skeleton adjacency and interaction distance modules to model the interactions. We compare our results to interaction generation and motion prediction methods and outperform them. We develop a new architecture to generate the motion of two people interacting based on a class label. Our architecture leverages the capabilities of diffusion models, Transformer architecture, and bipartite graph networks. Our results show that our method outperforms the state-of-the-art both quantitatively and qualitatively. We propose an application that uses our motion prediction method to allow a virtual agent to predict and recognize a person's motion in non-verbal interactions in a virtual environment. For this purpose, we propose a new 3D motion database captured with a high quality motion capture system and a depth camera.

0 Replies