A Novel Approach for Virtual Locomotion Gesture Classification: Self-Teaching Vision Transformer for a Carpet-Type Tactile Sensor

Sung-Ha Lee, Ho-Taek Joo, Insik Chung, Donghyeok Park, Yunho Choi, Kyung-Joong Kim

Published: 01 Jan 2024, Last Modified: 30 Sept 2024VR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Locomotion gesture classification in virtual reality (VR) is the process of analyzing and identifying specific user movements in the real world to navigate virtual environments. However, existing methods often necessitate the use of wearable sensors, which present limitations. To address this, we utilize a high-resolution carpet-type tactile sensor as a foot action recognition interface, which was previously unexplored in the context of locomotion gesture classification. This interface can capture the user’s foot pressure data in detail to distinguish similar actions. In this paper, to efficiently process captured user’s foot tactile data and classify nuanced actions, we utilize a Vision Transformer (ViT) architecture and propose a novel Self-Teaching Vision Transformer (STViT) model integrating elements of the Shifted window Vision Transformer (SwinViT) and Data-efficient image Transformer (DeiT). However, unlike DeiT, our model uses itself from $N -$steps prior as the teacher model, which is continuously updated. Therefore, improving the ability to classify actions by referencing its own knowledge from previous training stages progressively refines its understanding of similar action gestures. Also, we used the base architecture of SwinViT to utilize patch merging, which improves the ability to differentiate between variations in similar actions by capturing information at different scales. We evaluated seven vision-based methods, demonstrating promising results. Not only did our model outperform ResNet by 19.6%, but it also outperformed each Deit and SwinViT by 3.3% and 2.9%, achieving 92.7% accuracy. To validate our model’s real-world applicability, we conducted user preference tests and in-game performance evaluations with 18 participants. As a result, the participants preferred our model to SwinViT and DeiT, backing up the computational results. The video demonstrating the VR testing for STViT can be found in https://youtu.be/NJslvanRn18