Self-Distillation on Conditional Spatial Activation Maps for ForeGround-BackGround Segmentation

Yeruru Asrar Ahmed; Anurag Mittal

Self-Distillation on Conditional Spatial Activation Maps for ForeGround-BackGround Segmentation

Yeruru Asrar Ahmed, Anurag Mittal

Published: 09 Apr 2024, Last Modified: 23 Apr 2024SynData4CVEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Synthetic segmentation maps, Synthetic datasets, Foreground-Background Segmentation

TL;DR: We propose a single-stage encoder-decoder network for generating spatial conditioning activation maps for images to create synthetic ForeGround-BackGround segmentation datasets by replicating spatial conditional deeper layers of conditional GANs.

Abstract: Fine-grained image segmentation offers a simplified yet meaningful representation, but obtaining such representations for training large-scale models demands considerable human effort and cost. Existing strategies aim to predict these maps with limited or no training image pairs. When only a few train-label pairs are available, Semi-Supervised Segmentation (SSS) with the student-teacher paradigm is employed. Without labels, neural networks are designed to extract intermediate activation masks for unsupervised learning, mostly confined to 2-class Foreground-Background (FG-BG) segmentation. FG-BG unsupervised segmentation typically relies on intricately designed large-scale Generative Adversarial Networks (GANs) to generate intermediate activation maps. Additionally, conditional GANs also are utilised with spatial conditioning maps to generate FG-BG maps for conditional image generations, facilitating the creation of synthetic datasets. Moreover, transferring annotations to real-world data often requires using another segmentation network trained in a weakly supervised manner. Considering these multi-step approaches, we introduce a simple yet effective single-step approach that directly produces superior conditional FG-BG maps for images using a reconstruction network. Our proposed encoder-decoder network reconstructs the original image from slightly noisy inputs and generates precise conditional attention maps. These conditional attention maps are created by emulating the behaviour of deeper generator layers in spatial conditioning GANs and further refined using the student-teacher paradigm. Our approach stands out for its simplicity and efficiency compared to intricate multi-step methods or GAN-based designs.

Supplementary Material: pdf

Submission Number: 39

Loading