CLIP-HNet: Hybrid Network with Cross-Modal Guidance forSelf-Supervised Remote Sensing Dehazing

Shan Wang

Published: 26 Oct 2025, Last Modified: 29 Jan 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Unsupervised remote sensing dehazing remains a challenging andill-posed task due to the absence of reliable supervision signals. Ex-isting dehazing methods with unpaired data often oversimplify hazeremoval as style transfer, limiting generalization in complex scenar-ios. Moreover, current unimodal frameworks neglect cross-modalcues that could improve contextual reasoning. To address theseissues, we propose a novel cross-modal guided self-supervised de-hazing framework called CLIP-HNet, which achieves multi-modelfeature extraction, boundary-focused reconstruction and adaptivesample ffltering. Speciffcally, to capture global-local contextualfeatures, a hybrid feature interaction network is designed, whichbridges the feature representations of multi models with globalcontext-aware module (GCAM) and hybrid feature fusion module(HF2M). Then, based on the hybrid features, a boundary-awarefeature reconstruction (BFRec) is proposed to further reffne edgedetails. Furthermore, a CLIP-guided progressive information dis-tillation scheme is presented to dynamically prioritize trainingsamples and distill useful signals, which predicts haze concentra-tion by CLIP and progressively increases sample difffculty duringthe training stage. Finally, a frequency-domain texture matching(FTM) strategy reffnes texture and spectral details, enhancing themodel’s ability to recover ffne details. Experiments on syntheticand real RSIs demonstrate that the proposed CLIP-HNet surpassesstate-of-the-art approaches, achieving superior visual quality andquantitative performance.