An HASM-Assisted Voice Disguise Scheme for Emotion Recognition of IoT-Enabled Voice Interface

Wenjia Chen, Wenjuan Tang, Yan Meng, Yaoxue Zhang

Published: 01 Jan 2024, Last Modified: 16 May 2025IEEE Internet Things J. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Voice-enabled devices are becoming increasingly prevalent in the Internet of Things (IoT). Speech emotion recognition (SER), as a key technology in modern voice-assisted applications, holds tremendous potential for delivering convenient and intelligent services. Unfortunately, SER Service providers may not only analyze the emotions in users’ speech but also examine their speech content and voice characteristics, posing greater privacy risks. Existing real-time voice disguise methods, such as pitch scaling and vocal tract length normalization, provide significant technical support for the protection of voiceprint privacy but significantly impact the accuracy of SER. In this article, we propose a harmonic amplitude spectrum mapping (HASM) assisted voice disguise scheme, which disguises the voice for voiceprint privacy preservation while safeguarding the emotional information within the voice. Specifically, we first conduct an in-depth analysis of the features in the speech that can reflect emotions and find that restoring harmonic amplitude spectrum features after altering the speaker’s voice is crucial for recovering emotions in speech. Based on this discovery, we then preprocess the original speech signals with pitch scaling and design a HASM-assisted disguise scheme based on mathematical theory expression to restore the emotions. Our HASM-assisted voice disguise scheme is validated on the Berlin Emotional Speech Database, the LibriSpeech data set and VCTK data set. At voiceprint privacy protection levels of 81.86%, 85.42%, and 91.15% in the LibriSpeech data set and 96.83%, 98.25%, and 98.41% in the VCTK data set, respectively, the SER accuracy of acoustic feature-based disguised speech decreases by only 4.19%, 6.21%, and 9.87%, and the end-to-end SER accuracy decreases by only 3.69%, 7.37%, and 8.86%, which is superior to other voice disguise methods.