Submission Type: Short Papers (up to 4 pages)
Keywords: Video Generation, Video Understanding, AIGC Detection
TL;DR: We present the first ASMR-focused benchmark posing new challenges for frontier video generation and understanding.
Abstract: Recent video generation models can produce increasingly realistic videos, making synthetic content harder to distinguish from real footage. Existing benchmarks mainly emphasize semantic alignment or coarse physical plausibility, offering limited sensitivity to subtle realism failures. We present VideoASMR-Bench, a preliminary benchmark based on ASMR videos, a domain that naturally requires fine-grained audio--visual synchronization, material realism, and sensory consistency. The benchmark contains 1,500 real ASMR clips curated from social media and 2,235 synthetic counterparts generated by contemporary video generation models. Using a binary real-versus-fake judgment task, we conduct an initial evaluation of representative video-language models (VLMs) and human annotators. Our early findings show that even strong proprietary VLMs still lag behind humans in detecting AI-generated ASMR videos, while audio cues provide clear gains for authenticity judgment. These results suggest that ASMR is a sensitive and underexplored testbed for evaluating both video realism and multimodal video understanding.
Submission Number: 6
Loading