Integrates Spatiotemporal Visual Stimuli for Video Quality Assessment

Wenzhong Guo, Kairui Zhang, Xiao Ke

Published: 01 Jan 2024, Last Modified: 08 Apr 2025IEEE Trans. Broadcast. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: While feature extraction employing pre-trained models proves effective and efficient for no-reference video tasks, it falls short of adequately accounting for the intricacies of the Human Visual System (HVS). In this study, we proposed a novel approach to Integration of spatio-temporal Visual Stimuli into Video Quality Assessment (IVS-VQA) for the inaugural time. Exploiting the heightened sensitivity of optic rod cells to edges and motion, along with the capability to track motion via conjugate gaze, our approach affords a distinctive perspective on video quality assessment. To capture significant changes at each timestamp, we incorporate edge information to enhance the feature extraction of the pre-trained model. To tackle pronounced motion across the timeline, we introduce an interactive temporal disparity query employing a dual-branch transformer architecture. This approach adeptly introduces feature biases and extracts comprehensive global attention, culminating in enhanced emphasis on non-continuous segments within the video. Additionally, we integrate low-level color texture information within the temporal domain to comprehensively capture distortions spanning various scales, both higher and lower. Empirical results illustrate that the proposed model attains state-of-the-art performance across all six benchmark databases, along with their corresponding weighted averages.