Future Motion Dynamic Modeling via Hybrid Supervision for Multi-Person Motion Prediction Uncertainty Reduction
Abstract: Multi-person motion prediction remains a challenging problem due to the intricate motion dynamics and complex interpersonal interactions, where uncertainty escalates rapidly across the forecasting horizon. Existing approaches always overlook the motion dynamic modeling among the prediction frames to reduce the uncertainty, but leave it entirely up to the deep neural networks, which lacks a dynamic inductive bias, leading to suboptimal performance. This paper addresses this limitation by proposing an effective multi-person motion prediction method named Hybrid Supervision Transformer (HSFormer), which formulates the dynamic modeling within the prediction horizon as a novel hybrid supervision task. To be precise, our method performs a rolling predicting process equipped with a hybrid supervision mechanism, which enforces the model to be able to predict the pose in the next frames based on the (typically error-contained) earlier predictions. Addition to the standard supervision loss, two self and auxiliary supervision mechanisms, which minimize the distance of the predictions with error-contained inputs and the predictions with error-free inputs (ground truth) and guide the model to make accurate predictions based on the ground truth, are introduced to improve the robustness of our model to the input deviation in inference and stabilize the training process, respectively. The optimization techniques, such as stop-gradient, are extended to our model to improve the training efficiency.
Primary Subject Area: [Engagement] Emotional and Social Signals
Secondary Subject Area: [Engagement] Multimedia Search and Recommendation, [Experience] Interactions and Quality of Experience
Relevance To Conference: Our model can serve as a downstream task for many tasks, such as pose estimation. It involves multiple modalities of input, including image streams and human skeletons, to understand human behavior through better feature modeling.
Supplementary Material: zip
Submission Number: 451
Loading