VideoAlign: A Comprehensive Model for Evaluating Alignment Between Text and Generated Videos

20 Oct 2024 (modified: 05 Nov 2024)THU 2024 Fall AML SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Understanding, Reward Model, Muti-mode Learning
Abstract: Text-to-video generation models have made significant progress recently, but challenges remain in achieving alignment with human preferences. The generated videos frequently lack reliable consistency with their corresponding textual descriptions, and manual evaluation is both labor-intensive and expensive. This study proposes a comprehensive solution to address these alignment issues. We will introduce VideoAlign, an end-to-end reward model designed to automatically evaluate the instruction-following capabilities of video generation models.
Submission Number: 17
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview