Abstract: Incrementally learning new semantic concepts while retaining existing information is fundamental for several real-world applications. While behaviors of different sizes of backbones and architectural choices have been studied to propose efficient limited-sized architectures within many non-incremental computer vision applications, only large convolutional and Visual Transformer (ViT) backbones have been explored for class-incremental semantic segmentation, without providing a fair comparison wrt model size. In this work, we propose a fair study across existing class-incremental semantic segmentation methods, focusing on the models efficiency wrt their memory footprint. Moreover, we propose TILES (Transformer-based Incremental Learning for Expanding Segmenter), a novel approach exploiting small-size ViT backbones efficiency to offer an alternative solution where severe memory constraints are applied. It is based on expanding the architecture with the increments, allowing to learn new tasks while retaining old knowledge within a limited memory footprint. Besides, in order to tackle the background semantic shift, we apply adaptive losses specific to the incremental branches, while balancing old and new knowledge. Furthermore, we exploit the confidence of each incremental task to propose an efficient branch merging strategy. TILES provides state-of-the-art results on challenging benchmarks using up to $14$ times fewer parameters.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Massimiliano_Mancini1
Submission Number: 4903
Loading