Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: With the rapid development of generative models, AI-Generated Content (AIGC) has exponentially increased in daily lives. Among them, Text-to-Video (T2V) generation has received widespread attention. Though many T2V models have been released for generating high perceptual quality videos, there is still lack of a method to evaluate the quality of these videos quantitatively. To solve this issue, we establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date. The dataset is composed of 10,000 videos generated by 9 different T2V models, along with each video's corresponding mean opinion score. Based on T2VQA-DB, we propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA). The model extracts features from text-video alignment and video fidelity perspectives, then it leverages the ability of a large language model to give the prediction score. Experimental results show that T2VQA outperforms existing T2V metrics and SOTA video quality assessment models. Quantitative analysis indicates that T2VQA is capable of giving subjective-align predictions, validating its effectiveness. The dataset and code will be released upon publication.
Primary Subject Area: [Experience] Interactions and Quality of Experience
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: Generated videos, especially text-generated videos have become an important component of multimedia. The work targeting at understanding and measuring the human quality of experience when viewing text-generated videos. For that purpose, we propose a text-to-video dataset and a novel model for the quality assessment of text-generated videos. The dataset contains a large scale of generated videos, along with the mean opinion scores from multiple subjects, which describes the real human interaction with generated videos and helps the design of new metrics. Subsequently, we propose a multimodal model, which embeds and fuses information from video and text simultaneously, to give subjective-aligned predictions on the quality of text-generated videos. To conclude, this work is highly related to the theme of the conference.
Supplementary Material: zip
Submission Number: 678
Loading