No blind alignment but generation: A different view of continuous sign language recognition based on diffusion
Abstract: Continuous sign language recognition (CSLR) tasks typically rely on cross-modal alignment. However, due to the weakly supervised nature of CSLR, manual alignment often fails to align sign frames with glosses accurately. In this paper, we present a novel perspective for CSLR by treating it as a video-to-text generation task. Utilizing a diffusion model, we bypass the error-prone weakly supervised alignment. By fully exploiting the cross-modal feature correspondence capability of the diffusion model and capitalizing on the semantic consistency between sign videos and textual glosses, the gloss features are obtained directly. Furthermore, we introduce a contrastive learning-based strategy to enhance feature representation during the denoising phase, offering a sophisticated alternative to the simplistic attention mechanisms traditionally employed and guiding the denoising process of the diffusion model. Comprehensive experiments conducted on four public continuous sign language recognition datasets highlight the effectiveness of our generative approach in CSLR, demonstrating that it outperforms state-of-the-art methods.
External IDs:dblp:journals/pr/GengLMLM26
Loading