Intelligent Multimodal Artificial Agents that Talk and Express Emotions

Niyati Rawal, Rahul Singh Maharjan, Marta Romeo, Roberto Bigazzi, Lorenzo Baraldi, Rita Cucchiara, Angelo Cangelosi

Published: 01 Jan 2024, Last Modified: 19 May 2025HFR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the advent of Large Language Models (LLMs) like ChatGPT and Llama, humans are now able to hold meaningful dialogues with artificial agents. However, these LLMs do not take human emotions into account and their responses often sound neutral, without carrying any emotion. It is important for social robots to be able to understand human emotions and respond to them in an appropriate manner. In this study, we introduce a multimodal agent that responds to humans in an empathetic manner by taking their facial expressions into consideration. We finetune this model using Llama2 by concatenating the embeddings from the facial features with the textual embeddings, as an input. To account for the multimodality, the generated response always contain the sentence “You look <Facial Expression>.” This is followed by the response generated by Llama2. We provide some qualitative samples. For quantitative analysis, we conduct a survey asking the participants to answer whether they find the responses generated by our finetuned model empathetic or not. As a result of the survey, we show that the responses generated by our finetuned model are empathetic, thereby, proving the effectiveness of our method.