Abstract: Video Shadow Detection (VSD) aims to detect the shadow masks with frame sequence. Existing works suffer from inefficient temporal learning. Moreover, few works address the VSD problem by considering the characteristic (i.e., boundary) of shadow. Motivated by this, we propose a Timeline and Boundary Guided Diffusion (TBGDiff) network for VSD where we take account of the past-future temporal guidance and boundary information jointly. In detail, we design a Dual Scale Aggregation (DSA) module for better temporal understanding by rethinking the affinity of the long-term and short-term frames for the clipped video. Next, we introduce Shadow Boundary Aware Attention (SBAA) to utilize the edge contexts for capturing the characteristics of shadows. Moreover, we are the first to introduce the Diffusion model for VSD in which we explore a Space-Time Encoded Embedding (STEE) to inject the temporal guidance for Diffusion to conduct shadow detection. Benefiting from these designs, our model can not only capture the temporal information but also the shadow property. Extensive experiments show that the performance of our approach overtakes the state-of-the-art methods, verifying the effectiveness of our components. We release the codes at https://github.com/haipengzhou856/TBGDiff.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: Video shadow detection is crucial for improving visual perception in multimedia processing. Shadows can significantly affect the quality and accuracy of various multimedia applications such as object recognition, tracking, and scene understanding. Recent single-image shadow detection works have achieved promising results, while the challenging scenario, i.e., video, is still waiting to be explored.
This paper introduces Timeline and Boundary Guided Diffusion (TBGDiff), which is the first work to explore the temporal guidance for the Diffusion Model to conduct video shadow detection. In detail, we design a Dual Scale Aggregation (DSA) module for better temporal understanding by rethinking the affinity of the long-term and short-term frames for the clipped video. Next, we introduce Shadow Boundary Aware Attention (SBAA) to utilize the edge contexts for capturing the characteristics of shadows. Moreover, we are the first to introduce the Diffusion Model for VSD in which we explore a Space-Time Encoded Embedding (STEE) to inject the temporal guidance for Diffusion to conduct shadow detection. Extensive experiments show that our method achieves state-of-the-art performance.
We hope our novel method can provide insight for the multimedia community specializing in understanding video shadow detection.
Supplementary Material: zip
Submission Number: 342
Loading