[AML]CleveReward: Contrastive Learning-Engined Video Reward Training on Different Benchmark Datasets

THU 2024 Winter AML Submission30 Authors

11 Dec 2024 (modified: 02 Mar 2025)THU 2024 Winter AML SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Text-to-video generation models, Human preferences alignment, Reward model, Alignment scoring
Abstract: With the rapid development of text-to-video (T2V) generation models, the ability to synthesize high-quality videos from textual descriptions has significantly improved. However, despite the continuous enhancement of video quality, the consistency between the generated video and the text remains a major challenge. In particular, inconsistencies or hallucinations between the generated video and the text can negatively impact the output quality. Therefore, accurately assessing the instruction-following capability of video generation models is crucial. Given the high cost and limited scalability of manual evaluation, this paper proposes an end-to-end reward model to automate the evaluation of instruction-following performance in video generation models. A key issue in training video understanding models and reward models is data insufficiency. The inconsistency in labeling standards and dimensions across video scoring datasets complicates dataset integration. To address this problem, we introduce a contrastive learning-based video reward model, CleveReward, which converts video scoring datasets such as T2VQA and videofeedback into pairwise formats and trains using contrastive learning. Experimental results demonstrate that CleveReward effectively trains across datasets and holds the potential to surpass current state-of-the-art video reward models.Furthermore, we introduce VideoCross, an open-source dataset designed to support contrastive learning. VideoCross integrates data from various standards, reduces redundancy, and enhances consistency, providing high-quality data support for model training. By constructing positive and negative sample pairs, VideoCross helps improve the model's understanding of the alignment between text and video, thereby enhancing the instruction-following performance of video generation models. During training, we employed two advanced video understanding models, Qwen2-vl-7b-chat and cogvlm2-video, and utilized 16 A800 GPUs for hardware acceleration.This research offers new insights into the evaluation and optimization of text-to-video generation models and advances the development of reward models and datasets to better support the application and advancement of video generation technology.
Submission Number: 30
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview