Abstract: Sign language serves as a critical communication medium for the deaf community, yet existing single-view recognition systems are limited in interpreting complex three-dimensional manual movements from monocular video sequences. Although multi-view analysis holds potential for improved spatial understanding, current methods lack effective mechanisms for cross-view feature correlation and adaptive multi-stream fusion. To address these challenges, we propose the Cross-view and Multi-level Transformer (CMTformer), a novel framework for isolated sign language recognition that hierarchically models spatiotemporal dependencies across viewpoints. The architecture integrates transformer-based modules to simultaneously capture dense cross-view correlations and distill high-level semantic relationships through multi-scale feature abstraction. Complementing this methodological advancement, we establish the Multi-View Chinese Sign Language (MVCSL) dataset under real-world conditions, addressing the critical shortage of multi-view benchmarking resources. Experimental evaluations demonstrate that CMTformer significantly outperforms conventional approaches in recognition robustness, particularly in processing intricate gesture dynamics through coordinated multi-view analysis. This study advances sign language recognition via interpretable cross-view modeling while providing an essential dataset for developing viewpoint-agnostic gesture understanding systems.
External IDs:dblp:journals/mms/GuanHJSY25
Loading