LexSign: Learning Sign Language from Lexical Descriptions

ICLR 2026 Conference Submission7787 Authors

16 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sign Language Recognition, Non-verbal Communication, Zero-shot Learning
TL;DR: This paper focuses on collecting, generating, and leveraging lexical descriptions for advancing Sign Language Understanding (SLU) by capturing the sub-unit structure of sign language.
Abstract: Sign languages are well-defined natural languages that convey meaning through both manual postures and non-manual expressions. While recent methods effectively transcribe sign language videos into compact textual tokens, they often overlook the intrinsic subunit-level structures of sign language. In this work, we explore leveraging the hierarchical structure within lexical descriptions to enhance fine-grained sign language understanding. Specifically, we first construct LexSign, a large-scale dataset comprising both manually curated and automatically generated lexical descriptions of signs. To guarantee the quality of generated descriptions, we build LexSign-Bench, a benchmark to comprehensively evaluate the sign language understanding capability of Multi-modal Large Language Models (MLLMs), and further propose a perceive-then-summarize pipeline that leverages large foundation models to generate high-quality lexical descriptions. Based on the constructed LexSign, we propose Hierarchical Action-Language Interaction (HALI) that conducts hierarchical alignment between lexical descriptions and sign language videos to obtain more distinguishable and generalizable visual representations. Experimental results on public datasets demonstrate that incorporating the collected lexical descriptions with the proposed HALI significantly improves performance across different sign language understanding tasks.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7787
Loading