Context-Aware Emotion Recognition via Multi-View  Instruction-Tuned Visual Language Guidance

Jia-Kai Hu; Yi Hsien Lu; Ta-Wei Yang; Cheng-Fu Chou

Context-Aware Emotion Recognition via Multi-View Instruction-Tuned Visual Language Guidance

Jia-Kai Hu, Yi Hsien Lu, Ta-Wei Yang, Cheng-Fu Chou

20 Sept 2025 (modified: 29 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Context-aware Emotion Recognition, Vision-Language Models (VLMs), Representation Learning, Disentangled Representation, Interpretability, Parameter-Efficient Fine-Tuning (PEFT)

TL;DR: We propose a parameter-efficient method that adapts a Vision-Language Model to recognize emotions by interpretably disentangling scene, body, and face cues from a single image

Abstract: Context-aware emotion recognition often relies on heterogeneous cues, but many state-of-the-art systems still hinge on engineered signals (e.g., pose landmarks or temporal cues), limiting applicability. Meanwhile, VLM based emotion recognition remains relatively under-explored in current re search. Our work targets this gap with a parameter-efficient, interpretable design. To mitigate class imbalance and make view–emotion relations explicit, we first curate an LLM-assisted QA dataset. In Stage 1, the VLM is adapted into a multi-view emotion encoder that extracts fine-grained features from scene, body, and face using shared, parameter-efficient com ponents with view-specific pathways, enabling interpretable evidence dis entanglement from a single image. In Stage 2, the VLM remains frozen and its scene/body/face descriptors are fused by a lightweight head. This preserves VLM knowledge (avoiding overfitting and label coupling) while yielding independent, well-calibrated scores that support flexible thresh olds, plug-and-play label sets, and strong sample efficiency. Using only single-image inputs, our pipeline attains 37.88 mAP on EMOTIC, 88.82% top-1 accuracy on CAER-S, and higher recall/F1 on HECO than prior VLM-based baselines, while offering clear per-view interpretability. Code, prompts, and data splits will be released.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 24394

Loading