ZE-FESG: A Zero-Shot Feature Extraction Method Based on Semantic Guidance for No-Reference Video Quality Assessment
Abstract: Although the current deep neural network based no-reference video quality assessment (NR-VQA) methods can effectively simulate the human visual system (HVS), their interpretability is getting worse. The current methods only extract the low-level features of space and time of the video and do not consider the impact of high-level semantics. However, the high-level semantic information in the video related to human subjective perception and related to its own quality can be perceived by the HVS. In this work, we design the multidimensional feature extractor (MDFE), which takes the text descriptions related to video quality factors as semantic guidance, and uses the Contrastive Language-Image Pre-training (CLIP) model to perform zero-shot multidimensional feature extraction. Then, we further propose a zero-shot feature extraction method based on semantic guidance (ZE-FESG), which treats the MDFE as a feature extractor and acquires all the semantically corresponding features of the video by sliding over each frame of the video. Extensive experiments show that the proposed ZE-FESG has better interpretability and performance than the current mainstream 2D-CNN based feature extraction methods for NR-VQA. The code will be released on https://github.com/xiao-mi-d/ZE-FESG.
Loading