Dense Local Consistency Loss for Video Semantic Segmentation

Published: 01 Jan 2023, Last Modified: 26 Jul 2025IC-NIDC 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Existing image semantic segmentation models often suffer from temporal inconsistency between consecutive frames when processing continuous video inputs. While using optical flow or incorporating historical frame information can alleviate this issue, the resulting increase in parameters and computational complexity is detrimental to real-time tasks. In contrast, we propose a dense local consistency loss dubbed DLCL, which introduces spatial local semantic consistency constraints between consecutive frames in the task of video semantic segmentation. During training, DLCL is calculated based on the cosine similarity of feature embeddings for the same object in consecutive frames. Our DLCL is simple yet effective, easily integrated into both single-frame and video semantic segmentation models, and improves the temporal consistency and segmentation accuracy of predicted frames without adding any parameters or computational overhead during inference. We conduct experiments on the large-scale multi-scene video semantic segmentation dataset: VSPW, to demonstrate the effectiveness of our approach. The results consistently show performance improvements in both singleframe and video semantic segmentation models, validating the efficacy of our method.
Loading