Multimodal Emotion Captioning Using Large Language Model with Prompt Engineering

Published: 01 Jan 2024, Last Modified: 15 May 2025MRAC@MM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: This paper addresses the challenges in MER 2024 by focusing on the Open Vocabulary (OV) task, which extends beyond traditional fixed label space for multimodal emotion recognition. The study emphasizes the use of Large Language Models (LLMs) to interpret and extract emotional information from multimodal inputs, complemented by speech transcription, speech emotion description, and video clues. The paper explores the integration of these features into a prompt fed into a pre-trained LLaMA3-8B model, utilizing prompt engineering to achieve satisfactory results without fine-tuning. This approach bridges the gap between speech, video and text data, leveraging the full potential of LLMs for open-ended emotion recognition tasks and introducing a solution to the field.
Loading