Abstract: End-to-end sign language generation models do not accurately represent the prosody of the languages. This lack of temporal and spatial variation in generated signs leads to poor quality and lower human perception. In this paper, we seek to improve prosody in generated sign languages by modeling intensification in a data-driven manner with strategies grounded in the linguistics of sign language by enhancing the representation of intensifiers in the gloss annotations. To employ our strategies, we first annotate a subset of the benchmark PHOENIX14T dataset with different levels of intensification. We then use a supervised intensity tagger to extend the tagging to the whole dataset. This enhanced dataset is then used to train state-of-the-art transformer models for sign language generation. We find that our efforts in intensifier modeling yield better results evaluated with automated metrics. Human evaluation also indicates a significantly higher preference of the videos generated using our strategies in the presence of intensity modifiers.
0 Replies
Loading