Abstract: Weakly supervised semantic segmentation with image-level labels is of great significance since it alleviates the dependency on dense annotations. However, it is a challenging task as it aims to achieve a mapping from high-level semantics to low-level features. In this work, we propose a three-step method to bridge this gap. First, we rely on the interpretable ability of deep neural networks to generate attention maps with class localization information by back-propagating gradients. Secondly, we employ an off-the-shelf object saliency detector with an iterative erasing strategy to obtain saliency maps with spatial extent information of objects. Finally, we combine these two complementary maps to generate pseudo ground-truth images for the training of the segmentation network. With the help of the pre-trained model on the MS-COCO dataset and a multi-scale fusion method, we obtained mIoU of 62.1% and 63.3% on PASCAL VOC 2012 val and test sets, respectively, achieving new state-of-the-art results for the weakly supervised semantic segmentation task.
0 Replies
Loading