Jointly Harnessing Prior Structures and Temporal Consistency for Sign Language Video Generation

Yucheng Suo, Zhedong Zheng, Xiaohan Wang, Bang Zhang, Yi Yang

Published: 2024, Last Modified: 21 Jan 2026ACM Trans. Multim. Comput. Commun. Appl. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Sign language provides a way for differently-abled individuals to express their feelings and emotions. However, learning sign language can be challenging and time consuming. An alternative approach is to animate user photos using sign language videos of specific words, which can be achieved using existing image animation methods. However, the finger motions in the generated videos are often not ideal. To address this issue, we propose the Structure-aware Temporal Consistency Network (STCNet), which jointly optimizes the prior structure of humans with temporal consistency to produce sign language videos. We use a fine-grained skeleton detector to acquire knowledge of body structure and introduce both short- and long-term cycle loss to ensure the continuity of the generated video. The two losses and keypoint detector network are optimized in an end-to-end manner. Quantitative and qualitative evaluations on three widely used datasets, namely LSA64, Phoenix-2014T, and WLASL-2000, demonstrate the effectiveness of the proposed method. It is our hope that this work can contribute to future studies on sign language production.