Unsupervised Pose Estimation by Means of an Innovative Vision Transformer

Published: 01 Jan 2022, Last Modified: 15 May 2025ICAISC (2) 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Attention-only Transformers [34] have been applied to solve Natural Language Processing (NLP) tasks and Computer Vision (CV) tasks. One particular Transformer architecture developed for CV is the Vision Transformer (ViT) [15]. ViT models have been used to solve numerous tasks in the CV area. One interesting task is the pose estimation of a human subject. We present our modified ViT model, Un-TraPEs (UNsupervised TRAnsformer for Pose Estimation), that can reconstruct a subject’s pose from its monocular image and estimated depth. We compare the results obtained with such a model against a ResNet [17] trained from scratch and a ViT finetuned to the task and show promising results.
Loading