Multi-Feature Audio Fusion for Nonverbal Vocalization Classification

Published: 01 Jan 2025, Last Modified: 11 Aug 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Nonverbal vocalizations serve an essential role in general communication and expression, and they are particularly integral for non- and minimally-speaking individuals with autism and other neurodevelopmental disorders. Accurate analysis and classification of these vocalizations are crucial for enhancing communication and deepening our understanding of this underserved population. However, most audio-processing machine-learning techniques rely on linguistic cues like speech fluency and word pronunciation, and they do not generalize well to nonverbal vocalizations. We propose a multi-level fusion network that combines three types of audio features – Wav2Vec2 representations, mel-spectrograms, and low-level descriptors – using 6,551 samples across 7 vocalization classes from the open-access ReCANVo dataset. The model achieves an accuracy of 64.09% for the 7-class classification task, outperforming 8 traditional audio classification methods and 3 feature-fusion approaches. To address heterogeneity and noise in the real-world audio data, we tested four sample augmentation techniques, resulting in a 14% relative increase in accuracy. We further examined a single individual (1,595 samples), obtaining a 6.29% relative increase in accuracy, highlighting the effects of speaker and environment variability in population-level models. We also achieved a binary classification accuracy of 85.80% for vocalizations associated with positive/negative affective states, suggesting a potentially robust and highly separable latent structure underlying the valence of these sounds. Finally, we quantified the inter-class relationships between vocalizations using cosine similarity to offer additional insights into the acoustic patterns between classes. Our exploratory study not only uncovers key challenges in processing nonverbal vocalizations but also provides a foundational framework for future machine learning research with rare, real-world audio data, including multi-feature fusion and sample augmentation.
Loading