From Semantic Categories to Fixations: A Novel Weakly-supervised Visual-auditory Saliency Detection Approach
Abstract: Thanks to the rapid advances in the deep learning tech-
niques and the wide availability of large-scale training sets,
the performances of video saliency detection models have
been improving steadily and significantly. However, the
deep learning based visual-audio fixation prediction is still
in its infancy. At present, only a few visual-audio sequences
have been furnished with real fixations being recorded in
the real visual-audio environment. Hence, it would be nei-
ther efficiency nor necessary to re-collect real fixations un-
der the same visual-audio circumstance. To address the
problem, this paper advocate a novel approach in a weakly-
supervised manner to alleviating the demand of large-scale
training sets for visual-audio model training. By using the
video category tags only, we propose the selective class ac-
tivation mapping (SCAM), which follows a coarse-to-fine
strategy to select the most discriminative regions in the
spatial-temporal-audio circumstance. Moreover, these re-
gions exhibit high consistency with the real human-eye fixa-
tions, which could subsequently be employed as the pseudo
GTs to train a new spatial-temporal-audio (STA) network.
Without resorting to any real fixation, the performance of
our STA network is comparable to that of the fully super-
vised ones.
0 Replies
Loading