Abstract: We introduce an enhanced spatial perception module, as shown in Fig. 1 , pre-trained on multiple image quality assessment datasets, and a lightweight temporal fusion module to address the no-reference visual quality assessment (NR-VQA) task. This model implements Swin Transformer V2 [1] as a local-level spatial feature extractor and fuses these multi-scale features to enhance the quality-aware information. Furthermore, a temporal transformer is utilized for spatiotemporal feature fusion. To accommodate compressed videos of varying bitrates, we incorporate a coarse-to-fine contrastive strategy, that is, the group contrast loss is used for coarse discrimination of different bitrates, and the rank loss is used at a fine-grained level to enrich the model’s capability to discriminate different quality level.
Loading