WTPose: Waterfall Transformer for Multi-person Pose Estimation

Navin Ranjan, Bruno Artacho, Andreas E. Savakis

Published: 2025, Last Modified: 07 Nov 2025WACV (Workshops) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Human pose estimation is an important problem with broad applications that can be particularly useful for privacy preservation when analyzing activities and human-object interactions. We propose the Waterfall Transformer architecture for Pose estimation (WTPose), a single-pass, end-to-end trainable framework designed for multi-person pose estimation. Our framework leverages a transformer-based waterfall module that generates multi-scale feature maps from various backbone stages. The module performs filtering in the cascade architecture to expand the receptive fields and to capture local and global context, therefore in-creasing the overall feature representation capability of the network. Our experiments on the COCO dataset demonstrate that the proposed WTPose architecture, with a modified Swin backbone and transformer-based waterfall module, outperforms other transformer architectures for multi-person pose estimation.

External IDs:dblp:conf/wacv/RanjanAS25