Abstract: The design of image and video quality assessment (QA) algorithms is extremely important to benchmark and calibrate user experience in modern visual systems. A major drawback of the state-of-the-art QA methods is their limited ability to generalize across diverse image and video data with reasonable distribution shifts. In this work, we leverage the denoising process of diffusion models for generalized image QA (IQA) and video QA (VQA) by understanding the degree of alignment between learnable quality-aware text prompts and images or video frames. In particular, we learn cross-attention maps from intermediate layers of the denoiser of latent diffusion models (LDMs) to capture quality-aware representations of images or video frames. Since applying text-to-image LDMs for every video frame is computationally expensive for videos, we only estimate the quality of a frame-rate sub-sampled version of the original video. To compensate for the loss in motion information due to frame-rate subsampling, we propose a novel temporal quality modulator. Our extensive cross-database experiments across various user-generated, synthetic, low-light, frame-rate variation, ultra high definition, and streaming content-based databases show that our model can achieve superior generalization in both IQA and VQA.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We thank the reviewers for detailed comments on our paper. We make the following updates to the revised
manuscript and also provide the updated manuscript highlighting the changes in the supplementary:
1. Provided a description of various categories of quality assessment algorithms viz. full-reference,
reduced-reference, no-reference in the introduction. (This is in response to Reviewer 8hmB Q1)
2. Explained challenges with applying latent diffusion models for the purpose of quality assessment in
the introduction. (This is in response to Reviewer EQgr Q1)
3. Explained how our contributions enable the effective application of diffusion models for IQA and
VQA in the introduction. Our contributions allow one to leverage the generalization capabilities of
diffusion models for QA. (This is in response to Reviewer EQgr Q3 and Reviewer iFTE Q1)
4. Explained the motivation behind the use of cross-attention maps in the design of GenzIQA and
GenzVQA in Sec. 3.2. (This is in response to Reviewer iFTE Q1 part 1)
5. Explained the motivation for contextual prompt tuning and use of antonym prompt pairs in Sec 3.2.
( This is in response to Reviewer iFTE Q1 part 2 and Reviewer 8hmB Q3)
6. Updated the manuscript with the training loss function for GenzIQA (Eq. 5) and GenzVQA (Eq.
10). (This is in response to Reviewer 8hmB Q4)
7. Provided a detailed caption and updated Fig. 2, presenting a clearer understanding of temporal
quality modulator. (This is in response to Reviewer iFTE Q2 and Reviewer 8hmB Q4)
8. Added a detailed description of various aspects of the SlowFast network used in TQM in Sec. 3.3.
Also, provided an intuitive description of the integration of SlowFast networks with the diffusion
model in the same section. (This is in response to Reviewer iFTE Q2 and Reviewer iFTE Q4)
9. Added the inference time timestep details in Sec 4.1. ( This is in response to Reviewer iFTE Q3)
10. Provided a comparison of VQA models’ parameters in Tab. 12. (This is in response to Reviewer
EQgr Q7.)
Assigned Action Editor: ~Yu-Xiong_Wang1
Submission Number: 4964
Loading