EMO-Codec: An In-Depth Look at Emotion Preservation Capacity of Legacy and Neural Codec Models with Subjective and Objective Evaluations

Published: 01 Jan 2024, Last Modified: 20 May 2025APSIPA 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Neural codecs reduce speech data transmission latency and serve as the underlying tokenizer for speech language models (speech LMs). Preserving emotional information in codes is crucial for effective communication and contextual understanding. However, there is a lack of studies on emotion loss in existing codecs. This paper evaluates neural and legacy codecs using subjective and objective methods on emotion datasets like IEMOCAP. Our study identifies which codecs best preserve emotional information at various bitrates. We found that training a codec with both English and Chinese data had limited success in retaining emotional information in Chinese. Additionally, resynthesizing speech through these codecs degrades the performance of speech emotion recognition (SER), especially for emotions such as sadness, depression, fear, and disgust. Human listening tests confirmed these findings. This work guides the development of future speech technology to ensure that new codecs maintain the integrity of emotional information in speech.
Loading