GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We introduce GRADEO-Instruct and GRADEO, a dataset and model for evaluating AI-generated videos through multi-step reasoning, aligning better with human evaluations.
Abstract: Recent great advances in video generation models have demonstrated their potential to produce high-quality videos, bringing challenges to effective evaluation. Unlike human evaluation, existing automated evaluation metrics lack high-level semantic understanding and reasoning capabilities for video, thus making them infeasible and unexplainable. To fill this gap, we curate **GRADEO-Instruct**, a multi-dimensional T2V evaluation instruction tuning dataset, including 3.3k videos from over 10 existing video generation models and multi-step reasoning assessments converted by 16k human annotations. We then introduce **GRADEO**, one of the first specifically designed video evaluation models, which **grades** AI-generated **videos** for explainable scores and assessments through multi-step reasoning. Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real-world scenarios. The models, datasets, and codes will be released soon.
Lay Summary: Recent AI tools can create impressive videos, but it’s hard to tell how good these videos really are. Current automatic methods to judge video quality don’t understand the story or details in the video like humans do, so their ratings aren’t very reliable or easy to explain. To fix this, we gathered thousands of AI-made videos and asked many people to give detailed feedback on them, focusing on different aspects of video quality. Using this feedback, we built a new AI system called GRADEO that can score videos in a way that matches human opinions and explains why it gave that score. Our tests show that GRADEO judges videos better than other automatic methods. We also found that today’s AI video makers still have trouble creating videos that make sense in real life.
Primary Area: Deep Learning
Keywords: Text-to-Video Generation, Evaluation, MLLMs
Submission Number: 4591
Loading