Exploring Rich and Efficient Spatial Temporal Interactions for Real Time Video Salient Object Detection
Abstract: We have witnessed a growing interest in video salient
object detection (VSOD) techniques in today’s computer vision
applications. In contrast with temporal information (which is still
considered a rather unstable source thus far), the spatial infor-
mation is more stable and ubiquitous, thus it could influence our
vision system more. As a result, the current main-stream VSOD
approaches have inferred and obtained their saliency primarily
from the spatial perspective, still treating temporal information
as subordinate. Although the aforementioned methodology of
focusing on the spatial aspect is effective in achieving a numeric
performance gain, it still has two critical limitations. First, to
ensure the dominance by the spatial information, its temporal
counterpart remains inadequately used, though in some complex
video scenes, the temporal information may represent the only
reliable data source, which is critical to derive the correct
VSOD. Second, both spatial and temporal saliency cues are often
computed independently in advance and then integrated later
on, while the interactions between them are omitted completely,
resulting in saliency cues with limited quality. To combat these
challenges, this paper advocates a novel spatiotemporal network,
where the key innovation is the design of its temporal unit.
Compared with other existing competitors (e.g., convLSTM),
the proposed temporal unit exhibits an extremely lightweight
design that does not degrade its strong ability to sense temporal
information. Furthermore, it fully enables the computation of
temporal saliency cues that interact with their spatial coun-
terparts, ultimately boosting the overall VSOD performance
and realizing its full potential towards mutual performance
improvement for each. The proposed method is easy to implement
yet still effective, achieving high-quality VSOD at 50 FPS in real-
time applications.
0 Replies
Loading