PST-Transformer: A Two-Stage Model for 3-D Driving Pose Estimation

Hui Zhao, Ruisheng Yuan, Weicheng Zheng, Zhenyu Zhang, Chengjie Wang, Li Li

Published: 01 Jan 2024, Last Modified: 14 Nov 2024IEEE Internet Things J. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Driver monitoring systems are becoming more common in modern cars, and they are crucial as autonomous vehicles depend on the driver’s continued attention. The increasing application of the deep learning techniques in in-car driver monitoring systems can be attributed to their success in estimating the human body position. In the 3-D human posture estimation, recent transformer-based methods have demonstrated remarkable effectiveness. However, as the number of joints increases, the computing cost to generate the joint-to-joint affinity matrix grows quadratically. To this end, this research develops a pretrained spatial-temporal transformer (PST-Transformer) model to facilitate the issue. In the pretrained phase, a masking module is used to randomly mask the joints. An autoencoder is employed to rebuild the distorted 2-D poses. During the training process, a temporal downsampling approach is advised to cut down on the duplicate data. To forecast the 3-D driving poses, an aggregator is paired with the fine tuned pretrained encoder. Prior to extracting 3-D spatial and temporal characteristics, the encoder in the PST-Transformer could learn the 2-D spatial-temporal relationships. To test the suggested approach, a new driving posture data set named human driving in vehicle (HDIV) is also created, which includes a variety of driving behaviors. Extensive experiments on the HDIV and the widely used Human3.6M data set show that our technique beats the state-of-the-art methods in terms of accuracy and computing complexity.