Dedicated Encoding-Streams Based Spatio-Temporal Framework for Dynamic Person-Independent Facial Expression Recognition
Abstract: The facial expression recognition (FER) task is widely considered in the modern human-machine platforms (human support robots) and the self-service ones. The important attention given to the FER application is translated by the various architectures and datasets proposed to develop efficient automatic FER frameworks. This paper proposes a new, yet efficient appearance-based deep framework for dynamic FER referred to as Dedicated Encoding-streams based Spatio-Temporal FER (DEST-FER). It considers four input frames where the last presents the peak of the emotion and each input is encoded through a CNN streams. The four streams are joined using LSTM units that perform the temporal processing and the prediction of the dominant facial expression. We considered the challenging FER protocol, which is the person-independent one. To make the DEST-FER more robust to this constraint, we preprocessed the input frames by highlighting 49 landmarks characterizing the emotion’ regions of interest, and applying an edge-based filter. We evaluated 12 CNN architectures for the appearance-based encoders on three benchmarks. The ResNet18 model managed to be the best performing combination with the LSTM units, and led the top FER performance that outperformed the SOTA works.
Loading