Abstract: Predicting human motion requires addressing dependencies and errors for pose forecasting from sequences. The transformer’s self-attention aids this, but its complexity poses computational challenges. We present an efficient DeformMLP network without self-attention, using fully connected layers. DeformMLP includes DeformFCs, DeformFCt, and DeformFCst layers for spatial temporal modeling and calibration. DeformFCs capture semantics, DeformFCt learns relationships by summarizing time tokens, and DeformFCst assigns significance to dimensions to reduce computation. Our method balances efficiency and accuracy through decomposition and weight allocation. Evaluation on Human3.6M, 3DPW, CMU-MoCap datasets shows state-of-the-art prediction performance by benchmarks. The code is publicly available at https://github.com/HHT-98/DeformMLP.
Loading