Abstract: The constrained data scale in low-level vision often induces the demon overfitting hazard for restoration networks, necessitating the adoption of the pre-training paradigm. Mirroring the success of the high-level pre-training approaches, recent methods in the low-level community aim to derive general visual representation from extensive data with synthesized degradation. In this paper, we propose a new perspective beyond the data-driven image pre-training paradigm for low-level vision, building upon the following examination. First, unlike the semantic extraction prevalent in high-level vision tasks, low-level vision primarily focuses on the continuous and content-agnostic pixel-level regression, indicating that the diversified image contents inherent in large-scale data are potentially unnecessary for low-level vision pre-training. Secondary, considering the low-level degradations are highly relevant to the frequency spectrum, we discern that the low-level pre-training paradigm can be implemented in the Fourier space with fostered degradation sensibility. Therefore, we develop an Image-Free Pre-training (IFP) paradigm, a novel low-level pre-training approach with necessity of single randomly sampled Gaussian noise image, streamlining complicated data collection and synthesis procedure. The principle of the IFP involves reconstructing the original Gaussian noise from the randomly perturbed counterpart with partially masked spectrum band, facilitating the capability for robust spectrum representation extraction in response to the capricious downstream degradations. Extensive experiments demonstrate the significant improvements brought by the IFP paradigm to various downstream tasks, such as 1.31dB performance boost in low-light enhancement for Restormer, and improvements of 1.2dB in deblurring and 2.42dB in deraining for Uformer.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: The paper presents a novel pre-training paradigm for low-level vision tasks called Image-Free Pre-training (IFP). It focuses on the ability to perform pixel-level regression without relying on actual image data, using only a randomly generated Gaussian noise image. This approach aims to reduce the model's sensitivity to frequency band changes, improving performance in downstream tasks like image deraining, deblurring, and low-light enhancement. While not directly addressing multimodal processing, IFP's methodology could influence it by providing a new way to pre-train models under data-constrained conditions, potentially applicable to multimodal contexts where visual data is limited or noisy.
Supplementary Material: zip
Submission Number: 316
Loading