SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Sign language retrieval, as an emerging visual-language task, has received widespread attention. Different from traditional video retrieval, it is more biased towards understanding the semantic information of human actions contained in video clips. Previous works typically only encode RGB videos to obtain high-level semantic features, resulting in local action details drowned in a large amount of visual information redundancy. Furthermore, existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training, and adopt offline RGB encoder instead, leading to suboptimal feature representation. To address these issues, we propose a novel sign language representation framework called Semantically Enhanced Dual-Stream Encoder (SEDS), which integrates Pose and RGB modalities to represent the local and global information of sign language videos. Specifically, the Pose encoder embeds the coordinates of keypoints corresponding to human joints, effectively capturing detailed action features. For better context-aware fusion of two video modalities, we propose a Cross Gloss Attention Fusion (CGAF) module to aggregate the adjacent clip features with similar semantic information from intra-modality and inter-modality. Moreover, a Pose-RGB Fine-grained Matching Objective is developed to enhance the aggregated fusion feature by contextual matching of fine-grained dual-stream features. Besides the offline RGB encoder, the whole framework only contains learnable lightweight networks, which can be trained end-to-end. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods on How2Sign, PHOENIX-2014T, and CSL-Daily datasets.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Engagement] Multimedia Search and Recommendation
Relevance To Conference: In this work, we focus on a new task in the visual language field, the sign language retrieval task. This task requires the understanding of human actions in video clips while matching this video with the corresponding text. Therefore it requires knowledge of both text-video retrieval and action recognition, which is naturally a multimodal task. Additionally, we extract features from sign language videos in two modalities: Pose and RGB. To fuse these modalities, we designed a unique mechanism that leverages the characteristics of sign language videos to aggregate local features with similar semantic information from inter-modal and intra-modal. Finally, we align the Text modality with the three video modalities of Pose, RGB, and Fusion. For better fusion performance, We also develop a supervised Pose-RGB fine-grained matching objective, to match the contextual fine-grained dual-stream features and implicitly align the fine-grained similarity matrix of Pose-Text and RGB-Text. Overall, our work involves multimodal alignment, multimodal feature fusion, and multimodal coarse- and fine-grained matching, which is closely related to the multimodal theme of this ACM MM2024 conference.
Supplementary Material: zip
Submission Number: 353
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview