From Emotion Recognition to Mind-Wandering Detection: A Comparative Analysis of Video-Based Emotion Foundation Models

Published: 13 May 2026, Last Modified: 13 May 2026CV4Edu - Computer Vision for Education (CVPR 2026)EveryoneRevisionsBibTeXCC BY 4.0
Keywords: mind-wandering detection, educational video analysis, cognitive state inference, affective computing, facial expression recognition, frozen encoders, foundation models, multimodal learning, emotion recognition, attention aware learning systems.
TL;DR: We test whether recent emotion-recognition foundation-model features transfer better to video-based mind-wandering detection than previous approaches; Emotion-LLaMA-based representations produce more ambiguous, less aligned predictions.
Abstract: Automated mind-wandering (MW) detection from educational video offers a potential path toward continuous and non-intrusive measurement of attentional state during learning. Recent work introduced a pragmatic starting point for video-based MW detection by transferring facial emotion recognition (ER) features to an in-lab reading dataset with MW labels, showing that an AffectNetpretrained ResNet50 encoder can support above-chance prediction. In this work, we revisit this approach in light of recent ER foundation models by evaluating four frozen feature extractors—the AffectNet-pretrained ResNet50 baseline, MAE, VideoMAE, and the full Emotion-LLaMA representations—within the same downstream MW classification task. Across experiments, the AffectNet-pretrained baseline remains the strongest overall encoder, while none of the newer Emotion-LLaMA-based representations improves MW prediction despite greater architectural sophistication. To understand this gap, we analyze per-encoder error profiles, prediction-score separability, shared versus encoder-specific failures, hard versus easy subsets, and Emotion-LLaMA’s predicted emotion labels. Results indicate that Emotion-LLaMA– a state-of-the-art foundation model across several ER benchmarks–produces more ambiguous MW decision scores, over-predicts MW more frequently and differs only weakly across MW-relevant error cases – that stronger emotion recognition models do not necessarily provide useful features for mind-wandering detection. Our findings showcase limitations of “emotion to mind wandering” transfer, highlighting the need for development of encoders that capture learning-specific signals.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Track: Proceeding Track
Submission Number: 26
Loading