Efficient Fine-Tuning Approaches on HuBERT for Speech Emotion Recognition on Multiple Labels

Efficient Fine-Tuning Approaches on HuBERT for Speech Emotion Recognition on Multiple Labels

ACL ARR 2024 June Submission971 Authors

13 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Models like HuBERT have shown significant promise in automatic speech recognition (ASR). In this work, we explore both vanilla fine-tuning and parameter-efficient fine-tuning of the HuBERT model for speech emotion recognition (SER). While most previous research on SER has focused on four basic emotions—happy, sad, angry, and neutral—we extend this by incorporating additional emotions: surprise, fear, disgust, and calm, bringing the total to eight. Our experiments utilize four diverse datasets to enhance the robustness of our findings. Our methodology involves using the Wav2Vec2FeatureExtractor from the HuBERT model to extract features from raw audio files. These features are fed into a sequence classification model built on the HuBERT architecture. We fine-tuned the model in three different approaches -vanilla Finetuning, Parameter efficient finetuning over QKV projection and classifier using LoRA over a combination of several publicly available emotional speech datasets, including RAVDESS, CREMA-D, TESS, and SAVEE. The vanilla fine-tuned method outperforms all fine-tuned approaches overall. However, parameter-efficient approaches are still satisfactory and can be used in case of low resources and limited computational power.

Paper Type: Short

Research Area: Speech Recognition, Text-to-Speech and Spoken Language Understanding

Research Area Keywords: speech technologies,parameter-efficient-training,fine-tuning

Contribution Types: Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: english

Submission Number: 971

Loading