Abstract: Recognising apparent emotion from audio-visual signals in naturalistic conditions remains an open problem. Existing methods that build on recurrent models, or in the modelling of contextual dependencies at the feature level using self-attention fail to model the long-term dependencies that subtly occur at different levels of abstraction. Affective Processes have emerged as a novel paradigm to the modelling of temporal dynamics through a probabilistic global latent variable that captures context and induces dependencies in the outputs, showing superior performance with little complexity. Despite its impressive results on visual data, Affective Processes remain unexplored in the domain of audio data, known to crucially influence the perception of emotions. In this paper, we first revisit and extend Affective Processes to the speech domain, identifying the key components and learning procedures for their efficient training. We then extend Affective Processes to audio-visual affect recognition, using modality-specific context encoders. Finally, we propose a novel application of Affective Processes in the domain of Cooperative Machine Learning for propagating affect labels in videos using sparse human supervision. We conduct extensive ablation studies, identifying the main components behind the success of Affective Processes, as well as comparisons against existing works in a variety of datasets.
Loading