MSTformer: Multiscale Spatiotemporal Motion-aware Transformer Network for Effective AI-Generated Video Detection

ICLR 2026 Conference Submission19523 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI-generated video detection, Out-of-distribution generalization, Multiscale spatiotemporal modeling, Contrastive learning
Abstract: Recent AI-generated videos (e.g., Veo3) are growing increasingly realistic and indistinguishable from real videos. Current existing detectors usually rely on artifacts present in earlier or inferior generations, resulting in poor generalization to the newly published generators. To address the challenge of newly generated videos, we propose a novel dataset, AIDetection, for the AI-generated video detection task. The proposed AIDetection dataset contains 39,298 real and 19,731 generated videos from 27 diverse sources, specifically designed to evaluate cross-generator generalization under out-of-distribution settings. For the real videos, the motion of moving objects and the background show clear distinctions. Based on this observation, in this paper, we introduce a novel Multiscale Spatiotemporal motion modeling Transformer framework (MSTformer) for the AI-generated video detection task, which learns motion-aware discriminative representations from both local and global viewpoints. Specifically, a novel multiscale spatiotemporal downsampling mechanism is designed to capture local motion discrepancies between real and generated videos. Further, to prevent the discriminative cues from being weakened, we also employ a contrastive learning mechanism implemented on multiscale spatiotemporal features, enabling the model to maintain the global discriminative ability. Extensive experiments on three benchmark datasets (i.e. AIDetection, GVF, and GenVideo) demonstrate that MSTformer achieves the superior cross-domain generalization performance. In addition, ablation studies further confirm the effectiveness of multiscale temporal modeling and contrastive learning in enhancing robustness for AI-generated video detection.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 19523
Loading