X-TASAR: An Explainable Token-Selection Transformer Approach for Arabic Sign Language Alphabet Recognition

Published: 24 Nov 2025, Last Modified: 01 Dec 20255th Muslims in ML Workshop co-located with NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Transformer, sign language detection, arabic speakers, assistive technology.
TL;DR: An explainable method for accurate arabic sign language alphabet recognition.
Abstract: We propose a multistage transformer-based architecture for efficient Arabic Sign Language (ArSL) recognition. The proposed approach first extracts a compact ${7 \times 7}$ grid of image features using a tiny Swin transformer. We next determine a class-conditioned score of each grid token with the query [CLS] and pick a diverse Top-K subset through grid non-maximum suppression (NMS) algorithm. Only these K selected tokens together with [CLS] are then subjected to a small transformer-based classifier (ViT Tiny) to obtain the final label. The colored heatmap in the visualizations indicates which sections of the images had the highest scores, and the dots indicate the exact patches the classifier relied on to make its decision. Our model achieves 98.1\% accuracy and 0.979 macro-F1 on the held-out test split on the RGB ArSl alphabet dataset (32 classes, 7857 images collected from more than 200 signers). It is also computationally lighter than a ViT-Tiny baseline as it reads only K+1 tokens instead of all 196 patches. The proposed approach shows potential to be a backbone-agnostic and can be adapted into other vision transformers with minimal modification, enabling accessible and scalable sign-language recognition tools for Arabic-speaking deaf and hard-of-hearing communities worldwide.
Submission Number: 69
Loading