SignAligner: Harmonizing Complementary Pose Modalities for Coherent Sign Language Generation

03 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: sign language generation, sign language production
Abstract: Sign language generation faces the challenge of producing natural and expressive results due to the complexity of sign language, which involves hand gestures, facial expressions, and body movements. In this work, we propose a novel method called SignAligner for realistic sign language generation. The framework consists of three stages: text-driven multimodal co-generation, online collaborative correction, and realistic video synthesis. First, a joint generator incorporating a Transformer-based text encoder and cross-modal attention simultaneously produces posture, gesture, and body movements from text. Next, an online correction module refines the generated modalities using dynamic loss weighting and cross-modal attention to resolve spatiotemporal conflicts and enhance semantic consistency. Finally, the corrected poses are input into a pre-trained video generation network to synthesize high-fidelity sign language videos. Additionally, we introduce a dataset extension scheme that derives three new landmark representations (i.e., Pose, Hamer, and Smplerx) via pre-trained models, validated on PHOENIX14T and CSL-daily. Extensive experiments show that SignAligner significantly improves the accuracy and expressiveness of generated sign videos.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 1706
Loading