Multimodal Audio-Language Model for Speech Emotion Recognition

Jaime Bellver-Soler; Iván Martín-Fernández; Jose M. Bravo-Pacheco; Sergio Esteban Romero; Fernando Fernández Martínez; Luis Fernando D'Haro

Multimodal Audio-Language Model for Speech Emotion Recognition

Jaime Bellver-Soler, Iván Martín-Fernández, Jose M. Bravo-Pacheco, Sergio Esteban Romero, Fernando Fernández Martínez, Luis Fernando D'Haro

Published: 01 Jan 2024, Last Modified: 03 Apr 2025Odyssey 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In this paper, we present an approach to speech emotion recognition (SER) leveraging advances in machine learning and audio processing, particularly through the integration of Large Language Models (LLMs) with audio capabilities. Our proposed architecture combines an audio encoder, specifically the Whisper-large-v3 model [1], with LLMs Phi 1.5 [2] and Gemma 2b [3] to create a robust and effective system for categorizing emotions in speech. We compare the performance of our models against existing approaches, achieving outstanding results in speech emotion recognition. Our findings demonstrate the effectiveness of audio-language models (ALMs), with the Whisper-large-V3 and Gemma 2b combination outperforming other alternatives.

Loading