MixSignGraph: A Sign Sequence is Worth Mixed Graphs of Nodes

Shiwei Gan; Yafeng Yin; Zhiwei Jiang; Lei Xie; Sanglu Lu; Hongkai Wen

MixSignGraph: A Sign Sequence is Worth Mixed Graphs of Nodes

Shiwei Gan, Yafeng Yin, Zhiwei Jiang, Lei Xie, Sanglu Lu, Hongkai Wen

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sign Language Recognition and Translation, Graph convolutional network

Abstract: Recent advances in sign language research have benefited from CNN-based backbones, which are primarily transferred from traditional computer vision tasks (\eg object detection, image recognition). However, these CNN-based backbones usually excel at extracting features like contours and texture, but may struggle with capturing sign-related features. To capture such sign-related features, SignGraph model extracts the cross-region sign features by building the Local Sign Graph (LSG) module and the Temporal Sign Graph (TSG) module. However, we emphasize that although capturing cross-region dependencies can improve sign language performance, it may degrade the representation quality of local regions. To mitigate this, we introduce MixSignGraph, which represents sign sequences as a group of mixed graphs for feature extraction. Specifically, besides the LSG module and TSG module that model the intra-frame and inter-frame cross-regions features, we design a simple yet effective Hierarchical Sign Graph (HSG) module, which enhances local region representations following the extraction of cross-region features, by aggregating the same-region features from different-granularity feature maps of a frame, \ie to boost discriminative local features. In addition, to further improve the performance of gloss-free sign language task, we propose a simple yet counter-intuitive Text-based CTC Pre-training (TCTC) method, which generates pseudo gloss labels from text sequences for model pre-training. Extensive experiments conducted on the current five sign language datasets demonstrate that MixSignGraph surpasses the most current models on multiple sign language tasks across several datasets, without relying on any additional cues. Code and models are available at: \href{https://github.com/gswycf/SignLanguage}{\textcolor{blue}{https://github.com/gswycf/SignLanguage}}.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 11650

Loading