Multi-Attack Audio-Visual Spoof Detection for Secure Hearing-Assistive Systems Using Transformer Fusion

Aysha Munawwara, Kia Dashtipour, Nasir Saleem, Mandar Gogate, Adeel Hussain, Amir Hussain

Published: 28 Apr 2026, Last Modified: 28 May 2026ICCK Transactions on Information Security and CryptographyEveryoneRevisionsCC BY-SA 4.0

Abstract: Audio-visual spoofing attacks have emerged as a serious threat to modern hearing-assistive systems due to rapid advances in text-to-speech synthesis, neural vocoders, and lip-sync deepfake generation. Advanced hearing aids and cochlear implants increasingly incorporate AI-based speech enhancement and multimodal perception modules, which makes them vulnerable to manipulated or synthetic inputs. Traditional spoof detection approaches are often limited to binary classification between bonafide and spoofed speech, failing to capture the diversity of emerging multi-modal attack types.In this paper, we propose a multi-attack audio-visual spoof detection framework designed that explicitly models four spoof categories: real speech, text-to-speech (TTS) spoofing, vocoder-based spoofing, and lip-sync manipulation attacks. A multi-attack protocol is introduced to enable fine-grained supervision across both audio and video modalities. The proposed system employs convolutional feature extractors for each stream, followed by multimodal fusion for robust classification. Experimental results demonstrate reliable performance under in-dataset evaluation settings. Confusion matrix analysis further highlights the effectiveness of audio-visual fusion, particularly in detecting visually driven spoofing attacks. Overall, this work provides a strong foundation for next-generation secure hearing-assistive systems operating in real-world acoustic environments.

External IDs:doi:10.62762/tisc.2026.221187