Transformer-Based Full-Body Pose Estimation for Rehabilitation via RGB Camera and IMU Fusion

Published: 19 Aug 2025, Last Modified: 24 Sept 2025BSN 2025EveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the IEEE BSN 2025 conference submission's policy on behalf of myself and my co-authors.
Keywords: human pose estimation, rehabilitation exercises, multimodal sensing
TL;DR: The paper proposes a temporal transformer-based method that fuses images and IMU data to improve 3D human pose estimation for specialized rehabilitation exercises.
Abstract: Rehabilitation training plays a vital role in the recovery of lower back and cervical spine function. Human pose estimation can support this process by guiding and evaluating rehabilitation movements. However, specialized rehabilitation exercises often involve severe self-occlusions, posing significant challenges for vision-based pose estimation methods. We thus propose a full-body pose estimation framework tailored for rehabilitation exercises, which fuses monocular images and inertial measurement unit (IMU) signals using a temporal transformer. Multimodal data was collected from six subjects performing 22 specialized rehabilitation movements (e.g., single-leg open book, cross-leg body rotation, standing iliotibial band stretch, standing lumbar extension). The collected data comprises synchronized images, 2D and 3D human keypoint coordinates, and IMU signals. Our approach first employs a convolutional neural network (CNN) to extract 2D keypoints from image sequences. These keypoints, combined with IMU signals, are then processed by a temporal transformer to estimate 3D joint coordinates. On the collected data, a vision-only baseline yields a 2D joint position error of 7.33 $\pm$ 2.08 pixels and a 3D joint error of 10.05 $\pm$ 2.67 cm. In comparison, the proposed approach achieves lower errors, with 5.50 $\pm$ 0.75 pixels for 2D joints and 8.27 $\pm$ 1.03 cm for 3D joints. By leveraging inertial data, our approach enhances the robustness of pose estimation under challenging conditions such as self-occlusion, demonstrating its potential for both clinical and home-based rehabilitation applications.
Track: 3. Signal processing, machine learning, deep learning, and decision-support algorithms for digital and computational health
Tracked Changes: pdf
NominateReviewer: Name: Yuanshuo Tan, Email: tanyuanshuo@sjtu.edu.cn
Submission Number: 9
Loading