Trash to Treasure: Paving a New Way for Improving Video Understanding via Counterfactual Video Generation
Keywords: multimodal, video understanding, dataset and benchmark
Abstract: Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to ''visual ungrounded hallucination'', especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of generating and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured context throughout the editing and QA generation pipeline, it not only automates the generation of high-quality QA pairs but also inherently yields the paired data (original vs. edited) essential for contrastive training. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high‑quality QA pairs together with original–edited video pairs for contrastive training.
Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. Besides, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase incorporates $\ell_1$ normalization of advantages for each real-counterfactual pair, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0\% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.
Primary Area: datasets and benchmarks
Submission Number: 6752
Loading