SyncLipMAE: Contrastive Masked Pretraining for Audio–Visual Talking-Face Representations

06 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Lip Synchronization, Talking Face
Abstract: We introduce SyncLipMAE a self-supervised pretraining framework for talking-face video that learns synchronization-aware and transferable facial dynamics from unlabeled audio--visual streams. Our approach couples masked image/video modeling with cross-modal contrastive alignment and employs three per-frame prompt tokens that explicitly encode the essential factors of a talking-face frame—identity, vocal motion (speech-synchronized facial dynamics), and ambient motion (audio-agnostic movements such as blinks and head pose). The contrastive objective uses time-aligned vocal-motion and audio tokens as positives and misaligned pairs as negatives, driving both modalities into a shared embedding space and yielding token-level audio--visual stream synchronization. After pretraining, the aligned audio tokens together with the visual prompt tokens (identity, vocal motion, ambient motion) form a unified interface for four disparate downstream settings: (i) audio--visual stream synchronization; (ii) facial expression and head/face action recognition; (iii) visual speech recognition; and (iv) visual dubbing, for which we, for the first time, enable indistinguishable audio- or video-driven control within a single model. Across four task families that require distinct capabilities, \name achieves state-of-the-art results, underscoring the effectiveness of synchronization-aware, factorized self-supervised pretraining.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 2551
Loading