Keywords: Video Understanding, Reward Model, Muti-mode Learning
Abstract: Text-to-video generation models have made significant progress recently, but challenges remain in achieving alignment with human preferences. The generated videos frequently lack reliable consistency with their corresponding textual descriptions, and manual evaluation is both labor-intensive and expensive. This study proposes a comprehensive solution to address these alignment issues. We will introduce VideoAlign, an end-to-end reward model designed to automatically evaluate the instruction-following capabilities of video generation models.
Submission Number: 17
Loading