Keywords: Human sensing, Deep learning, wireless signal
TL;DR: We propose UniversalPose, an unified human pose estimation framework to process diverse modalities.
Abstract: We propose UniversalPose, a unified pose estimation framework that supports a wide range of sensing modalities, including WiFi, mmWave, acoustic, LiDAR, and depth. While recent methods have explored such alternative modalities to improve robustness in situations where conventional RGB-based approaches often fail (e.g., in low-light or occluded environments) and have privacy issues, they typically rely on modality-specific architectures, which limit their scalability and generalization to new sensor types.
UniversalPose addresses these limitations by transforming all inputs into a shared representation of token sequences, enabling a single architecture to handle heterogeneous data formats. To ensure efficient and stable learning, we introduce pseudo-3D positional embeddings and apply multi-modal locality-aware self-attention, even for modalities without explicit spatial coordinates.
Moreover, adopting such a modality-agnostic representation allows multi-modal fusion via simple token concatenation, which improves performance without architectural modifications.
Extensive experiments demonstrate that UniversalPose achieves comparable or superior accuracy to modality-specific expert models while supporting multiple modalities through joint training. Moreover, with synchronized multi-modal inputs, the same architecture outperforms the existing state-of-the-art fusion model. Our code will be publicly available.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10507
Loading