Efficient Fine-Tuning Approaches on HuBERT for Speech Emotion Recognition on Multiple Labels

ACL ARR 2024 June Submission971 Authors

13 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Models like HuBERT have shown significant promise in automatic speech recognition (ASR). In this work, we explore both vanilla fine-tuning and parameter-efficient fine-tuning of the HuBERT model for speech emotion recognition (SER). While most previous research on SER has focused on four basic emotions—happy, sad, angry, and neutral—we extend this by incorporating additional emotions: surprise, fear, disgust, and calm, bringing the total to eight. Our experiments utilize four diverse datasets to enhance the robustness of our findings. Our methodology involves using the Wav2Vec2FeatureExtractor from the HuBERT model to extract features from raw audio files. These features are fed into a sequence classification model built on the HuBERT architecture. We fine-tuned the model in three different approaches -vanilla Finetuning, Parameter efficient finetuning over QKV projection and classifier using LoRA over a combination of several publicly available emotional speech datasets, including RAVDESS, CREMA-D, TESS, and SAVEE. The vanilla fine-tuned method outperforms all fine-tuned approaches overall. However, parameter-efficient approaches are still satisfactory and can be used in case of low resources and limited computational power.
Paper Type: Short
Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding
Research Area Keywords: speech technologies,parameter-efficient-training,fine-tuning
Contribution Types: Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: english
Submission Number: 971
Loading