Abstract: In the realm of dyadic interactions, the ability to generate appropriate facial reactions is paramount for the conveyance of empathy and understanding. This paper introduces a novel framework that leverages the strengths of a diffusion model architecture, underpinned by a vector quantized variational autoencoder (VQ-VAE) to synthesize facial reactions that are contextually apt. We rigorously evaluate our model on the IEEE FG REACT2024 dataset, where it demonstrates superior performance, outshining baseline methods in terms of effectiveness. The results underscore the potential of our framework to enhance the fidelity of digital human interactions, paving the way for more nuanced and emotionally intelligent systems.
Loading