Abstract: Sign language is a visual language that encompasses all
linguistic features of natural languages and serves as the
primary communication method for the deaf and hard-of hearing communities. Although many studies have successfully adapted pretrained language models (LMs) for sign
language translation (sign-to-text), the reverse task—sign
language generation (text-to-sign)—remains largely unexplored. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using
a pretrained LM. To align sign language with the LM, we
leverage a decoupled tokenizer that discretizes continuous
signs into token sequences representing various body parts.
During decoding, unlike existing approaches that flatten all
part-wise tokens into a single sequence and predict one token at a time, we propose a multi-head decoding method
capable of predicting multiple tokens simultaneously. This
approach improves inference efficiency while maintaining
effective information fusion across different body parts. To
further ease the generation process, we propose a retrieval enhanced SLG approach, which incorporates external sign
dictionaries to provide accurate word-level signs as auxiliary conditions, significantly improving the precision of
generated signs. Extensive qualitative and quantitative
evaluations demonstrate the effectiveness of SOKE.
Loading