Mitigating context bias in vision–language models via multimodal emotion recognition

Corneliu Florea, Laura Florea

Published: 19 Aug 2025, Last Modified: 14 Feb 2026ElectronicsEveryoneCC BY 4.0

Abstract: Vision–Language Models (VLMs) have become key contributors to the state of the art in contextual emotion recognition, demonstrating a superior ability to understand the relationship between context, facial expressions, and interactions in images compared to traditional approaches. However, their reliance on contextual cues can introduce unintended biases, especially when the background does not align with the individual’s true emotional state. This raises concerns for the reliability of such models in real-world applications, where robustness and fairness are critical. In this work, we explore the limitations of current VLMs in emotionally ambiguous scenarios and propose a method to overcome contextual bias. Existing VLM-based captioning solutions tend to overweight background and contextual information when determining emotion, often at the expense of the individual’s actual expression. To study this phenomenon, we created synthetic datasets by automatically extracting people from the original images using YOLOv8 and placing them on randomly selected backgrounds from the Landscape Pictures dataset. This allowed us to reduce the correlation between emotional expression and background context while preserving body pose. Through discriminative analysis of VLM behavior on images with both correct and mismatched backgrounds, we find that in 93% of the cases, the predicted emotions vary based on the background—even when models are explicitly instructed to focus on the person. To address this, we propose a multimodal approach (named BECKI) that incorporates body pose, full image context, and a novel description stream focused exclusively on identifying the emotional discrepancy between the individual and the background. Our primary contribution is not just in identifying the weaknesses of existing VLMs, but in proposing a more robust and context-resilient solution. Our method achieves up to 96% accuracy, highlighting its effectiveness in mitigating contextual bias.