Abstract: Video dehazing is a critical research area in computer vision that aims to enhance the quality of hazy frames, which benefits many downstream tasks, e.g. semantic segmentation. Recent work devise CNN-based structure or attention mechanism to fuse temporal information, while some others utilize offset between frames to align frames explicitly. Another significant line of video dehazing research focuses on constructing paired datasets by synthesizing foggy effect on clear video or generating real haze effect on indoor scenes. Despite the significant contributions of these dehazing networks and datasets to the advancement of video dehazing, current methods still suffer from spatial–temporal inconsistency and poor generalization ability. We address the aforementioned issues by proposing a triplane smoothing module to explicitly benefit from spatial–temporal smooth prior of the input video and generate temporally coherent dehazing results. We further devise a query base decoder to extract haze-relevant information while also aggregate temporal clues implicitly. To increase the generalization ability of our dehazing model we utilize CLIP guidance with a rich and high-level understanding of hazy effect. We conduct extensive experiments to verify the effectiveness of our model to generate spatial–temporally consistent dehazing results and produce pleasing dehazing results of real-world data.
Loading