PerFRDiff: Personalised Weight Editing for Multiple Appropriate Facial Reaction Generation

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Human facial reactions play crucial roles in dyadic human-human interactions, where individuals (i.e., listeners) with varying cognitive process styles may display different but appropriate facial reactions in response to an identical behaviour expressed by their conversational partners. While several existing facial reaction generation approaches are capable of generating multiple appropriate facial reactions (AFRs) in response to each given human behaviour, they fail to take human's personalised cognitive process in AFRs' generation. In this paper, we propose the first online personalised multiple appropriate facial reaction generation (MAFRG) approach which learns a unique personalised cognitive style from the target human listener's previous facial behaviours and represents it as a set of network weight shifts. These personalised weight shifts are then applied to edit the weights of a pre-trained generic MAFRG model, allowing the obtained personalised model to naturally mimic the target human listener's cognitive process in its reasoning for multiple AFRs generations. Experimental results show that our approach not only largely outperformed all existing approaches in generating more appropriate and diverse generic AFRs, but also serves as the first reliable personalised MAFRG solution. Our code is provided in the Supplementary Material.
Primary Subject Area: [Engagement] Emotional and Social Signals
Relevance To Conference: Human facial reactions are influenced by both audio and visual behaviours expressed by the corresponding converstional partners. However, the optimal way to use audio and visual behaviours for generating personalised and appropriate facial reactions are still under explore. Our work contributes to the multimedia and multimodal processing domain by considering both audio and visual behaviours expressed by the human speaker to generate personalised facial reactions in human-to-human interactions, underscoring the multimodal nature of human communication. Inspired by the human visual and auditory systems that work separately and in combination with each other, we separately encode the audio-visual behaviour expressed by the conversational partner and then integrating them with the cross-attention mechanism. Our model not only achieves diverse, appropriate and realistic facial reaction generation but also simulates individual's personalised cognitive processes to enable personalised facial reaction generation. This approach significantly advances human-computer interaction technologies by mimicking human cognitive processes and integrating multimodal data in facial reaction generation. It represents a crucial step forward in making digital communications more natural and intuitive, demonstrating the vital role of multimodal processing in enhancing the realism and responsiveness of digital environments.
Supplementary Material: zip
Submission Number: 1776
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview