FlatPose: An Upsampling-Free Transformer for Human Pose Estimation

ICLR 2026 Conference Submission15597 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: human pose estimation, keypoint detection, computer vision
Abstract: Human Pose Estimation (HPE) methods often face a trade-off between accuracy and computational cost, which is largely driven by the reliance on upsampling layers to generate high-resolution feature maps. This paper examines the necessity of this convention, investigating whether spatial upsampling is indispensable for precise pose estimation. Our preliminary experiments reveal that coordinate classification-based methods exhibit notable robustness to feature map resolution, unlike their heatmap-based counterparts. This insight suggests a promising, yet challenging, path toward developing entirely upsampling-free architectures. To address the core challenge of recovering fine-grained geometric relationships from spatially coarse features, we introduce FlatPose, a novel and efficient framework that operates directly on low-resolution feature maps from the network backbone. At the heart of FlatPose is a two-stage hierarchical feature enhancement strategy. First, in the Global Encoding stage, we propose the Implicit Coordinate Attention mechanism, which empowers the model to learn a dynamic, content-aware ``semantic coordinate system" to model complex, non-local geometric structures from spatially coarse features. Second, in the Targeted Refinement stage, a Salience-Guided selection mechanism identifies the most critical feature regions, which are then deeply optimized via a targeted cross-attention module that focuses computation where it is most needed. Extensive experiments on the challenging COCO, MPII, and CrowdPose benchmarks show that FlatPose achieves a compelling balance between accuracy and computational efficiency. Our work validates that high-precision pose estimation is achievable without explicit upsampling, offering a new and effective paradigm for the field. Our code will be open source.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15597
Loading