An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability
TL;DR: We conduct an in-depth investigation into three pivotal factors that influence the configuration of In-Context Learning demonstrations on Multimodal Sentiment Analysis.
Abstract: The advancements in Multimodal Large Language Models (MLLMs) have enabled various multimodal tasks to be addressed under a zero-shot paradigm. This paradigm sidesteps the cost of model fine-tuning, emerging as a dominant trend in practical application. Nevertheless, Multimodal Sentiment Analysis (MSA), a pivotal challenge in the quest for general artificial intelligence, fails to accommodate this convenience. The zero-shot paradigm exhibits undesirable performance on MSA, casting doubt on whether MLLMs can perceive sentiments as competent as supervised models. By extending the zero-shot paradigm to In-Context Learning (ICL) and conducting an in-depth study on configuring demonstrations, we validate that MLLMs indeed possess such capability. Specifically, three key factors that cover demonstrations' retrieval, presentation, and distribution are comprehensively investigated and optimized. A sentimental predictive bias inherent in MLLMs is also discovered and later effectively counteracted. By complementing each other, the devised strategies for three factors result in average accuracy improvements of 15.9% on six MSA datasets against the zero-shot paradigm and 11.2% against the random ICL baseline.
Lay Summary: AI models struggle to understand human emotions in images or videos without being specifically trained for this task. This raises doubts about whether advanced AI systems can truly "sense" feelings like humans do.
We discovered that by carefully selecting and organizing examples shown to the AI — like teaching a student with well-designed practice questions — its ability to analyze emotions improves dramatically. We also identified and fixed a hidden bias in how these models predict emotions.
Our approach boosted accuracy by up to 15.9% across six emotion analysis datasets, proving that AI can interpret emotions effectively without costly retraining. This paves the way for more intuitive AI tools in mental health support, customer feedback analysis, and other real-world applications where understanding emotions matters.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Applications->Computer Vision
Keywords: Multimodal Sentiment Analysis, Multimodal Large Language Model, In-Context Learning
Submission Number: 16131
Loading