AdDriftBench: A Benchmark for Detecting Data Drift and Label Drift in Short Video Advertising

AdDriftBench: A Benchmark for Detecting Data Drift and Label Drift in Short Video Advertising

ACL ARR 2025 May Submission778 Authors

15 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: With the commercialization of short video platforms (SVPs), the demand for compliance auditing of advertising content has grown rapidly. The rise of large vision-language models (VLMs) offers new opportunities for automating ad content moderation. However, short video advertising scenarios present unique challenges due to $\textbf{data drift (DD)}$ and $\textbf{label drift (LD)}$. DD refers to rapid shifts in data distribution caused by advertisers to evade platform review mechanisms. LD arises from the evolving and increasingly standardized review guidelines of SVPs, which effectively alter the classification boundaries over time. Despite the significance of these phenomena, there is currently a lack of benchmark tools designed to evaluate model performance under such conditions. To address this gap, we propose $\textbf{AdDriftBench (ADB)}$. The ADB dataset consists of 3,480 short video ads, including 2,280 examples labeled under data drift scenarios, designed to evaluate the generalization capabilities of VLMs under rapidly shifting content distributions. An additional 1,200 examples represent label drift scenarios, aimed at assessing VLMs’ abilities in instruction following and fine-grained semantic understanding under varying auditing standards. Through extensive experiments on 16 open-source VLMs, we find that current models perform moderately in short video advertising contexts, particularly in handling fine-grained semantics and adapting to shifting instructions. Our dataset will be made publicly available.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, evaluation methodologies, evaluation, metrics, reproducibility, statistical testing for evaluation

Contribution Types: Data resources, Data analysis

Languages Studied: Chinese, English

Submission Number: 778

Loading