Where Motion Matters: Conditioning the Pre-Contextualization Interface for CTC-Based Sequence Learning

ACL ARR 2026 January Submission9847 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: continuous sign language recognition, CSLR, connectionist temporal classification, sequence modeling, video processing, motion cues, representation learning, robustness, evaluation methodologies, transfer learning / domain adaptation, error analysis
Abstract: In CTC-based sequence recognition, representations must transition from locally-encoded frame features to globally-contextualized sequences—yet the interface where this transformation occurs remains underexplored. We show that in CTC-based sequence learning, the pre-contextualization interface is a critical bottleneck, and that conditioning representations at this interface reduces alignment errors. We study this interface in Continuous Sign Language Recognition (CSLR), where we find that conditioning feature transformation on motion cues—rather than simply adding motion features—reduces alignment errors. We propose MoRE (Motion-conditioned Representation Enhancement), a lightweight module that uses motion-derived gates to interpolate between two learned projections of visual features before sequence modeling. Controlled ablations on PHOENIX-2014 isolate three key findings: (1) placement at the pre-contextualization interface is critical—post-contextualization placement degrades performance below baseline; (2) learned gating outperforms fixed alternatives; and (3) MoRE primarily reduces deletion errors, the dominant CTC failure mode. We show that where motion is applied—at the pre-contextualization interface—matters more than how it is incorporated under CTC supervision. We observe consistent improvements on PHOENIX-2014 and mixed results on CSL-Daily, suggesting dataset-specific factors influence effectiveness.
Paper Type: Long
Research Area: Multilinguality and Language Diversity
Research Area Keywords: image text matching, multimodality, video processing, cross-modal pretraining, cross-modal application, vision question answering
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Data analysis
Languages Studied: German Sign Language, Chinese Sign Language
Submission Number: 9847
Loading