MantisScore: A Reliable Fine-grained Metric for Video Generation

ACL ARR 2024 June Submission3711 Authors

16 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The recent years have witnessed great advances in text-to-video generation. However, the video evaluation metrics have lagged significantly behind, which fails to produce an accurate and holistic measure of the generated videos' quality. The main barrier is the lack of high-quality human rating data. In this paper, we release VideoEval, the first large-scale multi-aspect video evaluation dataset. VideoEval consists of high-quality human-provided ratings for 5 video evaluation aspects on the 37.6K videos generated from 11 existing popular video generative models. We train MantisScore based on VideoEval to enable automatic video quality assessent. Experiments show that the Spearman correlation between MantisScore and humans can reache 77.12 on VideoEval-test, beating the prior best metrics by about 50 points. Further result on the held-out EvalCrafter, GenAI-Bench, and VBench, show that MantisScore is highly generalizable and still beating the prior best metrics by a remarkable margin. We observe that using Mantis as the based model consistently beats that using Idefics2 and VideoLLaVA, and the regression-based model can achieve better results than the generative ones. Due to its high reliability, we believe MantisScore can serve as a valuable tool for accelerate video generation research.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Multimodal Evaluation; Multimodality and Language Grounding to Vision;
Contribution Types: NLP engineering experiment, Reproduction study, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 3711
Loading