Adaptive Facial Detail: in Large Vision-Language Models for Emotion Recognition
Abstract: While large vision-language models have shown strong generalization across multimodal tasks, they remain challenged by the nuanced requirements of emotion recognition in complex scenes under coarse-to-fine setting. A common limitation lies in their bias toward contextual information, often neglecting subtle facial micro-expressions signals or introducing hallucinated cues. To address this issue, we propose an context-based emotion recognition framework that incorporates facial Action Units (AUs) as structured expert knowledge within LVLM inference. Unlike static prompting approaches, our method models emotion-specific sensitivity to facial features by estimating dependency patterns across categories, enabling flexible adaptive action unit injection at runtime. The inference pipeline adopts a coarse-to-fine strategy: an initial prediction derived from global context guides the subsequent injection of selectively filtered AU descriptions, refining the model’s interpretation through dynamic context. Extensive evaluation shows that the proposed method improves zero-shot performance and achieves competitive results against fully supervised approaches, suggesting that adaptive fusion of scene-level and facial-level signals CBER is critical for reliable context-based emotion recognition.
Loading