Coarse-to-Fine Human Mesh Recovery with Transformers

Vatsal Agarwal, Mara Levy, Max Ehrlich, Youbao Tang, Ning Zhang, Abhinav Shrivastava

Published: 01 Jan 2024, Last Modified: 19 Jul 2025ECCV Workshops (13) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The introduction of Transformer networks in computer vision has resulted in rapid progress of deep models in a variety of vision tasks. Recently, there has been great success in utilizing such networks for the human mesh recovery task. While these works demonstrate remarkable performance, they suffer from high computational cost and slow speed due to the quadratic nature of the self-attention mechanism. In this work, we propose a coarse-to-fine modeling approach to improve the pipeline efficiency. We build upon previous approaches and adopt an encoder-decoder architecture to mine relationships between image, joint and vertex features. While previous works apply attention on the full set of vertex features, our key insight is that earlier model layers do not require such dense vertex representations and instead can rely on a sparser set of features. We evaluate our approach on the Human3.6M and 3DPW datasets and find that with our coarse-to-fine approach, we are able to achieve improved or competitive performance with a 3.7x reduction in FLOPs and a 1.7x reduction in activation count compared to state-of-the-art approaches.