Abstract: Unsupervised remote sensing dehazing remains a challenging andill-posed task due to the absence of reliable supervision signals. Ex-isting dehazing methods with unpaired data often oversimplify hazeremoval as style transfer, limiting generalization in complex scenar-ios.
Moreover, current unimodal frameworks neglect cross-modalcues that could improve contextual reasoning. To address theseissues, we propose a novel cross-modal guided self-supervised de-hazing framework called CLIP-HNet, which achieves multi-modelfeature extraction, boundary-focused reconstruction and adaptivesample ffltering. Speciffcally, to capture global-local contextualfeatures, a hybrid feature interaction network is designed, whichbridges the feature representations of multi models with globalcontext-aware module (GCAM) and hybrid feature fusion module(HF2M). Then, based on the hybrid features, a boundary-awarefeature reconstruction (BFRec) is proposed to further reffne edgedetails. Furthermore, a CLIP-guided progressive information dis-tillation
scheme is presented to dynamically prioritize trainingsamples and distill useful signals, which predicts haze concentra-tion
by CLIP and progressively increases sample difffculty duringthe training stage. Finally, a frequency-domain texture matching(FTM) strategy reffnes texture and spectral details, enhancing themodel’s ability to recover ffne details. Experiments on syntheticand real RSIs demonstrate that the proposed CLIP-HNet surpassesstate-of-the-art approaches, achieving superior visual quality andquantitative performance.
Loading