Abstract: Human motion prediction is crucial for applications ranging from robotics to human-computer interaction. This paper introduces a novel, multi-scale, spatiotemporal cross-attention-based algorithm for human motion prediction, which effectively models long-term dependencies in motion sequences. The proposed method leverages a dual-stream spatio-temporal Transformer framework that decouples temporal and spatial features, allowing each to independently capture dynamic temporal dependencies and spatial correlations. A key innovation is the introduction of a cross-attention mechanism, which ensures consistent information exchange between the temporal and spatial streams. Additionally, a multi-scale cross-attention mechanism is employed to capture relationships across different scales. Extensive experiments on benchmark datasets such as Human3.6M, CMU-MoCap, AMASS, and 3DPW demonstrate that the proposed model outperforms state-of-the-art methods in both short-term and long-term prediction accuracy. Metrics including MPJPE, MAE, PSEnt, and PSKLD validate the model's ability to generate accurate and smooth motion trajectories. Ablation studies further confirm the critical contributions of each component, highlighting the algorithm's robustness and efficiency. This research represents a significant advancement in human motion prediction, offering precise and reliable solutions for understanding and forecasting complex motion patterns.
External IDs:dblp:conf/ijcnn/GaoHX25
Loading