Visual context learning based on cross-modal knowledge for continuous sign language recognition

Published: 01 Jan 2025, Last Modified: 15 May 2025Vis. Comput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Continuous sign language recognition (CSLR) aims to recognize glosses in sign language videos. Unfortunately, existing methods ignore the linguistic knowledge implicitly contained in gloss (symbol label) sequences in the process of improving the extraction of visual features. In this paper, we propose a visual context learning network based on cross-modal knowledge for CSLR that utilizes the semantic information contained in glosses to bridge the semantic gap between sign language videos and glosses. A special cross-attention mechanism is introduced in the framework to ensure effective cross-modal knowledge exchange. Specifically, we construct two auxiliary tasks, namely spatiotemporal semantic modeling (SSM) and masked language modeling (MLM). SSM enhances the video representation by using a joint cross-attention encoder with semantic supervision. Moreover, MLM is applied to learn fine-grained contextual interactions between video and text. Experimental results on three public CSLR datasets (PHOENIX-2014, PHOENIX-2014-T, and CSL-Daily) show the effectiveness of the proposed method, improving over the previous state-of-the-art by 2.0 and 1.9\(\%\) error rate on CSL-Daily, respectively. Code is publicly available at https://github.com/Liuklin/VCL.
Loading