OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection

Victor Livernoche; Akshatha Arodi; Andreea Musulan; Zachary Yang; Adam Salvail; Gaétan Marceau Caron; Jean-François Godbout; Reihaneh Rabbany

OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection

Victor Livernoche, Akshatha Arodi, Andreea Musulan, Zachary Yang, Adam Salvail, Gaétan Marceau Caron, Jean-François Godbout, Reihaneh Rabbany

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Deepfake Detection, Misinformation, Disinformation, Dataset Benchmark, Crowdsourcing, Generative AI, Synthetic Images

TL;DR: A political-grounded deepfake detection dataset with realistic synthetic images and a crowdsourced adversarial platform for adaptive detection.

Abstract: Deepfakes, synthetic media created using advanced AI techniques, pose a growing threat to information integrity, particularly in politically sensitive contexts. This challenge is amplified by the increasing realism of modern generative models, which our human perception study confirms are often indistinguishable from real images. Yet, existing deepfake detection benchmarks rely on outdated generators or narrowly scoped datasets (e.g., single-face imagery), limiting their utility for real-world detection. To address these gaps, we present OpenFake, a large politically grounded dataset specifically crafted for benchmarking against modern generative models with high realism, and designed to remain extensible through an innovative crowdsourced adversarial platform that continually integrates new hard examples. OpenFake comprises nearly four million total images: three million real images paired with descriptive captions and almost one million synthetic counterparts from state-of-the-art proprietary and open-source models. Detectors trained on OpenFake achieve near-perfect in-distribution performance, strong generalization to unseen generators, and high accuracy on a curated in-the-wild social media test set, significantly outperforming models trained on existing datasets. Overall, our results offer encouraging evidence that detectors trained with high-quality data can generalize to real-world social-media distributions.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 21478

Loading