Context-Aware Emotion Understanding with Vision-Language Models

Yue Yao

Published: 28 Feb 2026, Last Modified: 26 Mar 2026OpenReview Archive Direct UploadEveryoneCC BY-NC 4.0

Abstract: Recent advances in large vision-language models (LVLMs) have significantly improved holistic scene interpretation, yet their performance on context-based emotion recognition remains limited. These models often favor global contextual cues while underutilizing subtle facial micro-expressions signals, leading to coarse predictions or spurious reasoning. In this work, we introduce a dynamic facial prior integration framework that augments LVLMs with adaptive Action Unit (AU) cues. Rather than relying on fixed or uniform facial feature injection, we characterize the varying dependence of emotion categories on facial micro-expressions through context-based emotion recognition. This profiling enables generating hallucinations and weighting of AU descriptors conditioned on the inferred emotional hypothesis. Our approach follows a two-stage reasoning process: an initial context-driven prediction is generated by the LVLM, followed by a refinement stage where adaptive action unit injection information is incorporated to resolve ambiguities. Experimental results demonstrate consistent gains over strong zero-shot and supervised baselines, highlighting the effectiveness of combining coarse scene understanding with fine-grained facial priors for robust emotion recognition.