Abstract: With the evolution of telemedicine, clean images, especially facial images, are crucial in areas such as symptom evaluation and cosmetic medicine. However, dealing with indiscernible facial images, a condition known as blind degradation, poses a formidable challenge for existing blind face restoration (BFR) techniques due to their inherently ill-posed nature. Thus, a delicate interaction between preserving local details and maintaining global geometries is desired. Despite advances in convolutional neural networks (CNNs), transformers, and denoising diffusion, where no matter serried convolutions, elaborately-designed self-attentions, or stochastic noises, tend to isolate either local or non-local features. To this end, a new candidate model termed Multi-Scale in Multi-Scale Mamba (M2Mamba) is proposed, which builds on a pioneering structured state space model-based network and includes three new components: multi-scale learned fusion module (MSLFM), multi-scale attention fusion module (MSAFM), and multi-scale inspired Mamba (MS-Mamba). Firstly, the MSLFM is adopted by extracting image-level global guidance from inputs of different scales, preserving intuitive perception yet enriching semantic understanding. Secondly, the MSAFM dynamically integrates features from various encoder stages. Thirdly, the MS-Mamba employs separate branches for both small- and large-scale receptive fields, benefiting modeling of long-range dependencies. In the final, M2Mamba is demonstrated on both synthetic and realistic benchmarks, showing comparable or better performance than state-of-the-art methods. The codes are available at: https://github.com/yanwd628/M2Mamba.
Loading