Abstract: Recently 3D human motion reconstruction has attracted a lot of attentions and found many applications in fields such as virtual reality, sport video analysis and etc.. Several video-based 3D human pose and shape estimation methods have been proposed. However, these methods still failed to solve the problem of accurately matching generated 3D model to the human body regions in the images. To tackle the problem, we propose a new model named transformer encoder with sliding attention window(TESA). Our method uses the VIBE structure as the backbone and proposes a multi-head attention encoder to generate more accurate human pose and shape. The multi-head attention structure in transformer can extract local and temporal features between frames, e.g. one attention head might focus on a hand region, while another attention head might extract features of a leg. Further more, a new module named sliding attention window is proposed to get smooth human motion by applying motion constraint between the current frame and it’s neighbor frames. Our method outperforms previous video-based methods in accuracy, and achieving state-of-the-art performance.
0 Replies
Loading