Dynamic Visual Speaking Patterns: You Are the Way You Speak

Published: 28 May 2025, Last Modified: 28 Feb 2026OpenReview Archive Direct UploadEveryoneCC BY-SA 4.0
Abstract: In this paper, we present a new perspective on analyzing the unique visual patterns specific to each speaker in talking head videos. Traditional speaker recognition methods primarily rely on static facial appearance. In contrast, our method emphasizes dynamic visual patterns, which we denote as Dynamic Visual speaking Patterns (DVPs). By concentrating on dynamic patterns, our approach not only facilitates speaker recognition but also inherently resists fake samples, with the potential to accurately identify the original speaker even in manipulated videos. To guide the model in learning DVPs beyond superficial appearance, we introduce three seamlessly integrable improvements to existing speaker recognition works: (1) Input level: We introduce negative samples that retain the original static facial features but distort dynamic patterns intentionally. The model then learns to distinguish dynamic patterns across different speakers effectively With the strategy of contrastive learning. (2) Feature level: We convert the input video frames from the spatio-temporal domain to the frequencytemporal domain, facilitating an easy distinction between static and dynamic patterns. (3) Learning strategy: We incorporate a Gradient Reversal Layer (GRL) to mitigate the model’s reliance on static features, forcing it to focus on dynamic patterns. We finally validate our work comprehensively based on a simple speaker recognition framework. Experimental results show that our method not only excels in speaker recognition tasks but also inherently resists manipulation in forged video samples. Moreover, the dynamic visual patterns lead to a new challenge task in talking head video analysis: Identifying Original Speaker in forged videos (IOS). The advantages of our proposed DVPs over traditional static speaker representations highlight the potential of DVPs for robust speaker recognition in both real and manipulated videos.
Loading