Lost in Translation: Lip-Sync Deepfake Detection from Audio-Video Mismatch

Published: 01 Jan 2024, Last Modified: 04 Mar 2025CVPR Workshops 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Highly realistic voice cloning combined with AI-powered video manipulation allows for the creation of compelling lip-sync deepfakes where anyone can be made to say things they never did. The resulting fakes are being used to entertain, but also for everything from election related disinformation to small- and large-scale fraud. Lip-sync deepfakes can be particularly difficult to detect because only the mouth and jaw of the person talking is modified. We describe a robust and general-purpose technique to detect these fakes. This technique begins by independently translating the audio (using audio-to-text transcription) and video (using automated lip-reading). We then show that the resulting transcriptions are significantly mismatched for lip-sync deepfakes as compared to authentic videos. The robustness of this technique is evaluated against a controlled dataset of our creation and in-the-wild fakes, all of varying length and resolution.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview