Abstract: Dynamic Facial Expression Recognition (DFER) in-the-wild poses a significant challenge in emotion recognition research. Many studies have focused on extracting finer facial features while overlooking the effect of noisy frames on the entire sequence. In addition, the imbalance between short- and long-term temporal relationships remains inadequately addressed. To tackle these issues, we propose the Multi-Snippet Spatiotemporal Learning (MSSL) framework that uses distinct temporal and spatial modeling for snippet feature extraction, enabling more accurate simulation of subtle facial expression changes while capturing finer details. We also introduced a dual-branch hierarchical module, BiTemporal Multi-Snippet Enhancement (BTMSE), which is designed to capture spatiotemporal dependencies and model subtle visual changes across snippets effectively. The Temporal-Transformer further enhances the learning of long-term dependencies, whereas learnable temporal position embeddings ensure consistency between snippet and fused features over time. By leveraging (2+1)D multi-snippet spatiotemporal modeling, BTMSE, and the Temporal-Transformer, MSSL hierarchically explores the complex interrelationships between temporal dynamics and facial expressions. Comparative experiments and ablation studies confirmed the effectiveness of our method on three large-scale in-the-wild datasets: DFEW, FERV39K, and MAFW.
External IDs:dblp:journals/ijon/LuZMZN25
Loading