N-CORE: N-View Consistency Regularization for Disentangled Representation Learning in Nonverbal Vocalizations
Abstract: Nonverbal vocalizations are an essential component of human communication, conveying rich information without linguistic content. However, the computational analysis of nonverbal vocalization faces significant challenges due to a lack of lexical anchors in the data, compounded by biased distributions of imbalanced multi-label data. While disentangled representation learning has shown promise in isolating specific speech features, its application to nonverbal speech remains unexplored. In this paper, we introduce N-CORE, a novel supervised framework designed to disentangle representations in nonverbal vocalizations by leveraging N views of the audio sample to learn invariance to specific perturbed features. We find that N-CORE achieves competitive performance compared to the baseline methods when tested for emotion and speaker classification tasks on the VIVAE, ReCANVo, and ReCANVo-Balanced datasets. We further propose an emotion perturbation function for audio signals that preserves speaker information, and validate speech transformation functions on nonverbal vocalizations. Our work informs research directions on the application of paralinguistic speech, including privacy-preserving encoding, clinical diagnoses of atypical speech, and longitudinal analysis of communicative development.
Paper Type: Long
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: automatic speech recognition, speech technologies, spoken language understanding
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: Paralinguistic Speech
Submission Number: 7553
Loading