CLiF-VQA: Enhancing Video Quality Assessment by Incorporating High-Level Semantic Information related to Human Feelings

Published: 20 Jul 2024, Last Modified: 04 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Video Quality Assessment (VQA) aims to simulate the process of perceiving video quality by the Human Visual System (HVS). Although subjective studies have shown that the judgments of HVS are strongly influenced by human feelings, it remains unclear how video content relates to human feelings. The recent rapid development of Vision-Language pre-trained models (VLM) has established a solid link between language and vision. And human feelings can be accurately described by language, which means that VLM can extract information related to human feelings from visual content with linguistic prompts. In this paper, we propose CLiF-VQA, which innovatively utilizes the visual linguistic capabilities of VLM to introduce human feelings features based on traditional spatio-temporal features to more accurately simulate the perceptual process of HVS. In order to efficiently extract features related to human feelings from videos, we pioneer the exploration of the consistency between Contrastive Language-Image Pre-training (CLIP) and human feelings in video perception. In addition, we design effective prompts, i.e., a variety of objective and subjective descriptions closely related to human feelings, as prompts. Extensive experiments show that the proposed CLiF-VQA exhibits excellent performance on several VQA datasets. The results show that introducing human feelings features on top of spatio-temporal features is an effective way to obtain better performance.
Primary Subject Area: [Experience] Interactions and Quality of Experience
Secondary Subject Area: [Content] Vision and Language, [Content] Multimodal Fusion
Relevance To Conference: This paper introduces language into the task of video quality assessment, utilizing Multimodal Large Language Models(MLLMs) under linguistic prompts to obtain features in the video that are related to human feelings. Therefore, this paper belongs to the vision and language theme as well as the Multimodal Fusion theme. Subjective studies have shown that the judgment of the human visual system can be influenced by human feelings. In addition, human feelings can be accurately described through language. In recent years, the rapid development of MLLMs has established a strong link between language and vision. Therefore, we can use MLLMs to extract information related to human feelings from visual content with linguistic prompts. In order to effectively extract features related to human feelings from videos, we verify that CLIP (Contrastive Language–Image Pre-training) has a high consistency with human feelings. Further, we introduce human feelings features on top of the traditional spatio-temporal features to better model the perception process of the human visual system.
Supplementary Material: zip
Submission Number: 949
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview