WTPose: Waterfall Transformer for Multi-person Pose Estimation

Published: 01 Jan 2025, Last Modified: 07 Nov 2025WACV (Workshops) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Human pose estimation is an important problem with broad applications that can be particularly useful for privacy preservation when analyzing activities and human-object interactions. We propose the Waterfall Transformer architecture for Pose estimation (WTPose), a single-pass, end-to-end trainable framework designed for multi-person pose estimation. Our framework leverages a transformer-based waterfall module that generates multi-scale feature maps from various backbone stages. The module performs filtering in the cascade architecture to expand the receptive fields and to capture local and global context, therefore in-creasing the overall feature representation capability of the network. Our experiments on the COCO dataset demonstrate that the proposed WTPose architecture, with a modified Swin backbone and transformer-based waterfall module, outperforms other transformer architectures for multi-person pose estimation.
Loading