Keywords: Federated learning, visual language models, small vision language models, quantized low-rank adaptation, context, privacy-preserving learning, emotion recognition
TL;DR: CAREFL enables efficient, privacy-preserving emotion recognition by federated fine-tuning of lightweight VLMs enriched with contextual prompts from larger models.
Abstract: Emotion recognition from images is a challenging task due to its dependence on subtle visual cues and contextual information. Recent advances in Vision-Language Models (VLMs) have demonstrated strong performance in this domain. Still, they are often limited by their large computational footprint and the privacy concerns associated with centralized training. To address these challenges, we propose CAREFL (Context-Aware Recognition of Emotions with Federated Learning), a framework for efficient emotion recognition. CAREFL combines large VLMs, specifically LLaVA 1.5, for generating rich contextual descriptions with lightweight small VLMs, SMOLVLM2, fine-tuned under a federated learning setup using Quantized Low-Rank Adaptation. This design enables accurate, privacy-preserving, and resource-efficient training on edge devices. Although this work evaluates CAREFL in the context of emotion recognition, the framework is general by design: leveraging VLMs, it can also be fine-tuned for a wide range of multimodal description and classification tasks beyond emotion analysis.
Extensive experiments demonstrate that CAREFL outperforms state-of-the-art baselines, achieving up to 96.49\% mAP and 50.36\% F1-score, surpassing heavier models such as GPT-4o, LLaVA, and EMOTIC. An ablation study further confirms the contribution of contextual enrichment, prompt design, and quantization in enhancing performance. The results show that federated fine-tuning of lightweight VLMs, when guided by contextual reasoning from large-scale models, provides a practical and scalable solution for emotion recognition in privacy-sensitive and resource-constrained environments.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 11504
Loading